English · Español
Lab 00 — Roofline on three accelerators¶
🇪🇸 La misma multiplicación de matrices en CPU, A100 y H100. Verás cómo cambia el techo, cómo cambia tu kernel, y qué tan lejos estás del pico en cada uno.
Goal¶
For the same FP16 matmul kernel (\(M = N = K = 8192\)), measure achieved FLOPS and HBM/RAM bandwidth on three distinct accelerators. Plot all three on a unified roofline. Explain the gap to peak on each.
This is the direct generalization of Phase 01's roofline lab to GPU hardware. The mental model is the same; the numbers are 10000× larger.
Prerequisites¶
- Phase 01 roofline lab completed (
experiments/01-roofline/exists with i5-8250U numbers). - A
runpod.io(orlambda.ai, orvast.ai) account. - PyTorch 2.4+ installed locally (for the CPU baseline) and available in the rental images.
Hardware targets¶
| Target | Provider | Suggested SKU | Hourly cost (2025) | Run time | Total cost |
|---|---|---|---|---|---|
| Intel i5-8250U | Local (Borja's laptop) | — | — | already done in Phase 01 | $0 |
| NVIDIA A100 80 GB SXM4 | RunPod | "A100 SXM 80GB" (Community Cloud) | ~$1.50/h | ~30 min | ~$0.80 |
| NVIDIA H100 80 GB SXM5 | RunPod | "H100 SXM 80GB" (Community Cloud) | ~$3.00/h | ~30 min | ~$1.50 |
**Total cloud cost: ~\(2.30** (with 1 h budget headroom: ~\)5 ceiling).
Alternative providers (use whichever has best availability):
lambda.ai— A100 ~\(1.10/h on-demand, H100 ~\)2.49/h.vast.ai— varies, often cheapest spot.
Setup script (for the cloud pod)¶
After SSH into the pod:
# 1. Verify hardware.
nvidia-smi
# Expect: "NVIDIA A100-SXM4-80GB" or "NVIDIA H100 80GB HBM3"; ~80 GB / ~141 GB depending on SKU.
# 2. Verify PyTorch CUDA build.
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# 3. Clone your work.
git clone <your-fork-of-lynx-cortex>
cd lynx-cortex
uv sync
The kernel under test¶
Single FP16 dense matmul. Same shape on every accelerator for direct comparability.
# experiments/x4-roofline/matmul_bench.py (reference structure; learner writes the body)
import torch, time, json, sys
from pathlib import Path
def bench_matmul(M=8192, N=8192, K=8192, dtype=torch.float16, device="cuda", warmup=5, iters=20):
"""
Returns (achieved_TFLOPS, achieved_GBps_one_way) for a single matmul.
Notes for learner:
- Use torch.cuda.Event with synchronize for GPU timing; time.perf_counter is wrong on GPU.
- Total FLOPs for matmul: 2 * M * N * K (one mul + one add per element of C).
- Bytes moved per op (lower bound): (M*K + K*N + M*N) * dtype.itemsize. This is the
ideal-cache estimate; real HBM traffic can be higher.
- Arithmetic intensity = 2*M*N*K / bytes_moved.
"""
# TODO (learner): implement.
...
The learner writes the body. The autograded test in tests/extension/X4/test_lab00_bench.py checks that the function returns sensible numbers (within a tolerance band) on whatever CUDA device runs the test.
What to measure¶
For each device, record:
| Metric | How |
|---|---|
| Peak FP16 TFLOPS (vendor) | From the datasheet (see theory/01 table) |
| Peak HBM/RAM bandwidth (vendor) | From the datasheet |
| Achieved TFLOPS (your kernel) | 2*M*N*K / median_time |
| Achieved bandwidth | bytes_moved / median_time |
| Arithmetic intensity | 2*M*N*K / bytes_moved (≈ 4096 for K=8192 cube — well above any roofline ridge) |
| MFU (model FLOPS utilization) | achieved / peak FP16 |
Expected results (sanity table)¶
These are the numbers you should land within ±20% of. If you are way off, suspect a benchmarking bug (no warmup, no cuda.synchronize, wrong FLOP count).
| Device | Peak FP16 (TF/s) | Peak BW (TB/s) | Expected achieved TF/s | Expected MFU |
|---|---|---|---|---|
| i5-8250U (CPU, FP32 via MKL) | ~0.17 | ~0.019 | ~0.10-0.15 | 60-85% |
| A100 80 GB (FP16) | 312 | 2.0 | 230-280 | 75-90% |
| H100 80 GB (FP16) | 989 | 3.35 | 700-850 | 70-85% |
[source: NVIDIA cuBLAS performance guide 2023; MLPerf inference matmul microbenchmarks]
Why these MFU values:
- CPU (BLIS / MKL gemm) is extremely well-tuned; MKL on Kaby Lake delivers ~80% of peak.
- A100 with cuBLAS hits ~85% on well-shaped matmuls.
- H100 is slightly harder to saturate at FP16 without using FP8 + Transformer Engine; cuBLAS gets ~75-85% on this size.
If you want to push H100 further, switch to FP8 via transformer_engine.pytorch.Linear — you should see ~1500 TF/s (still below the 1979 peak, but closer).
Roofline plot¶
Plot all three devices on one log-log plot:
- x-axis: arithmetic intensity (FLOP/byte).
- y-axis: performance (FLOP/s).
- For each device, draw the memory line (slope = BW) and the compute ceiling (horizontal at peak FLOPS).
- Place your kernel as one point per device.
For an \(8192^3\) FP16 matmul, the arithmetic intensity is far above every device's ridge, so all three points should land near the compute ceiling. If yours land below, that's a story to tell: which knob is missing?
Gap explanation (the actual learning)¶
For each device, write 3-5 sentences explaining the gap between achieved and peak. Expected answers:
- CPU: AVX2-FMA throughput limit; the L3 → register path is saturated; MKL's block-packing leaves a few % on the table.
- A100: cuBLAS launch overhead per call; the kernel is not using TF32 (FP16 is the spec); HBM2e write-back of
Cis unavoidable. - H100: cuBLAS at FP16 does not use the new FP8 Transformer Engine path; the new asynchronous WGMMA instruction needs explicit programming to fully exploit; FP16 is "slow mode" on H100.
The interview-grade insight: gaps are normal and tell you which knob to turn. Moving CPU from naive → MKL is 100×. Moving H100 from FP16 cuBLAS → FP8 Transformer Engine is another 1.5-2×.
Deliverables¶
experiments/x4-roofline/manifest.json— pinned versions, seeds, SKUs, hourly cost.experiments/x4-roofline/results.csv— one row per (device, dtype, shape, achieved_tflops, achieved_bw_gbps, mfu).experiments/x4-roofline/roofline.png— the three-device plot.experiments/x4-roofline/REPORT.md— 1 page narrating the gaps.
Definition of Done¶
- All three devices measured.
- Achieved FP16 TFLOPS on H100 ≥ 600 (i.e. MFU ≥ 60%).
- Achieved FP16 TFLOPS on A100 ≥ 200 (i.e. MFU ≥ 65%).
- Roofline plot is in the artifacts directory and rendered in the report.
- Gap explanation paragraph is concrete (cites the missing optimization), not generic.
Cross-links¶
- Phase 01 — Hardware Substrate: the CPU half of this lab.
theory/02-h100-and-h200.md: the H100 spec table.theory/04-datacenter-economics.md: why MFU matters as a cost lever.
References¶
- Williams S., Waterman A., Patterson D. 2009, Roofline: An Insightful Visual Performance Model for Multicore Architectures, CACM.
- NVIDIA cuBLAS Performance Guide, 2023.
- NVIDIA H100 Tuning Guide, 2023.