English · Español

Lab 00 — Roofline on three accelerators¶

🇪🇸 La misma multiplicación de matrices en CPU, A100 y H100. Verás cómo cambia el techo, cómo cambia tu kernel, y qué tan lejos estás del pico en cada uno.

Goal¶

For the same FP16 matmul kernel ($M = N = K = 8192$), measure achieved FLOPS and HBM/RAM bandwidth on three distinct accelerators. Plot all three on a unified roofline. Explain the gap to peak on each.

This is the direct generalization of Phase 01's roofline lab to GPU hardware. The mental model is the same; the numbers are 10000× larger.

Prerequisites¶

Phase 01 roofline lab completed (experiments/01-roofline/ exists with i5-8250U numbers).
A runpod.io (or lambda.ai, or vast.ai) account.
PyTorch 2.4+ installed locally (for the CPU baseline) and available in the rental images.

Hardware targets¶

Target	Provider	Suggested SKU	Hourly cost (2025)	Run time	Total cost
Intel i5-8250U	Local (Borja's laptop)	—	—	already done in Phase 01	$0
NVIDIA A100 80 GB SXM4	RunPod	"A100 SXM 80GB" (Community Cloud)	~$1.50/h	~30 min	~$0.80
NVIDIA H100 80 GB SXM5	RunPod	"H100 SXM 80GB" (Community Cloud)	~$3.00/h	~30 min	~$1.50

**Total cloud cost: ~$2.30** (with 1 h budget headroom: ~$5 ceiling).

Alternative providers (use whichever has best availability):

lambda.ai — A100 ~$1.10/h on-demand, H100 ~$2.49/h.
vast.ai — varies, often cheapest spot.

Setup script (for the cloud pod)¶

After SSH into the pod:

# 1. Verify hardware.
nvidia-smi
# Expect: "NVIDIA A100-SXM4-80GB" or "NVIDIA H100 80GB HBM3"; ~80 GB / ~141 GB depending on SKU.

# 2. Verify PyTorch CUDA build.
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# 3. Clone your work.
git clone <your-fork-of-lynx-cortex>
cd lynx-cortex
uv sync

The kernel under test¶

Single FP16 dense matmul. Same shape on every accelerator for direct comparability.

# experiments/x4-roofline/matmul_bench.py  (reference structure; learner writes the body)
import torch, time, json, sys
from pathlib import Path

def bench_matmul(M=8192, N=8192, K=8192, dtype=torch.float16, device="cuda", warmup=5, iters=20):
    """
    Returns (achieved_TFLOPS, achieved_GBps_one_way) for a single matmul.

    Notes for learner:
    - Use torch.cuda.Event with synchronize for GPU timing; time.perf_counter is wrong on GPU.
    - Total FLOPs for matmul: 2 * M * N * K (one mul + one add per element of C).
    - Bytes moved per op (lower bound): (M*K + K*N + M*N) * dtype.itemsize. This is the
      ideal-cache estimate; real HBM traffic can be higher.
    - Arithmetic intensity = 2*M*N*K / bytes_moved.
    """
    # TODO (learner): implement.
    ...

The learner writes the body. The autograded test in tests/extension/X4/test_lab00_bench.py checks that the function returns sensible numbers (within a tolerance band) on whatever CUDA device runs the test.

What to measure¶

For each device, record:

Metric	How
Peak FP16 TFLOPS (vendor)	From the datasheet (see `theory/01` table)
Peak HBM/RAM bandwidth (vendor)	From the datasheet
Achieved TFLOPS (your kernel)	`2MN*K / median_time`
Achieved bandwidth	`bytes_moved / median_time`
Arithmetic intensity	`2MN*K / bytes_moved` (≈ 4096 for K=8192 cube — well above any roofline ridge)
MFU (model FLOPS utilization)	achieved / peak FP16

Expected results (sanity table)¶

These are the numbers you should land within ±20% of. If you are way off, suspect a benchmarking bug (no warmup, no cuda.synchronize, wrong FLOP count).

Device	Peak FP16 (TF/s)	Peak BW (TB/s)	Expected achieved TF/s	Expected MFU
i5-8250U (CPU, FP32 via MKL)	~0.17	~0.019	~0.10-0.15	60-85%
A100 80 GB (FP16)	312	2.0	230-280	75-90%
H100 80 GB (FP16)	989	3.35	700-850	70-85%

[source: NVIDIA cuBLAS performance guide 2023; MLPerf inference matmul microbenchmarks]

Why these MFU values:

CPU (BLIS / MKL gemm) is extremely well-tuned; MKL on Kaby Lake delivers ~80% of peak.
A100 with cuBLAS hits ~85% on well-shaped matmuls.
H100 is slightly harder to saturate at FP16 without using FP8 + Transformer Engine; cuBLAS gets ~75-85% on this size.

If you want to push H100 further, switch to FP8 via transformer_engine.pytorch.Linear — you should see ~1500 TF/s (still below the 1979 peak, but closer).

Roofline plot¶

Plot all three devices on one log-log plot:

x-axis: arithmetic intensity (FLOP/byte).
y-axis: performance (FLOP/s).
For each device, draw the memory line (slope = BW) and the compute ceiling (horizontal at peak FLOPS).
Place your kernel as one point per device.

For an $8192^3$ FP16 matmul, the arithmetic intensity is far above every device's ridge, so all three points should land near the compute ceiling. If yours land below, that's a story to tell: which knob is missing?

Gap explanation (the actual learning)¶

For each device, write 3-5 sentences explaining the gap between achieved and peak. Expected answers:

CPU: AVX2-FMA throughput limit; the L3 → register path is saturated; MKL's block-packing leaves a few % on the table.
A100: cuBLAS launch overhead per call; the kernel is not using TF32 (FP16 is the spec); HBM2e write-back of C is unavoidable.
H100: cuBLAS at FP16 does not use the new FP8 Transformer Engine path; the new asynchronous WGMMA instruction needs explicit programming to fully exploit; FP16 is "slow mode" on H100.

The interview-grade insight: gaps are normal and tell you which knob to turn. Moving CPU from naive → MKL is 100×. Moving H100 from FP16 cuBLAS → FP8 Transformer Engine is another 1.5-2×.

Deliverables¶

experiments/x4-roofline/manifest.json — pinned versions, seeds, SKUs, hourly cost.
experiments/x4-roofline/results.csv — one row per (device, dtype, shape, achieved_tflops, achieved_bw_gbps, mfu).
experiments/x4-roofline/roofline.png — the three-device plot.
experiments/x4-roofline/REPORT.md — 1 page narrating the gaps.

Definition of Done¶

All three devices measured.
Achieved FP16 TFLOPS on H100 ≥ 600 (i.e. MFU ≥ 60%).
Achieved FP16 TFLOPS on A100 ≥ 200 (i.e. MFU ≥ 65%).
Roofline plot is in the artifacts directory and rendered in the report.
Gap explanation paragraph is concrete (cites the missing optimization), not generic.

Cross-links¶

Phase 01 — Hardware Substrate: the CPU half of this lab.
theory/02-h100-and-h200.md: the H100 spec table.
theory/04-datacenter-economics.md: why MFU matters as a cost lever.

References¶

Williams S., Waterman A., Patterson D. 2009, Roofline: An Insightful Visual Performance Model for Multicore Architectures, CACM.
NVIDIA cuBLAS Performance Guide, 2023.
NVIDIA H100 Tuning Guide, 2023.