Skip to content

English · Español

Lab 00 — Roofline on three accelerators

🇪🇸 La misma multiplicación de matrices en CPU, A100 y H100. Verás cómo cambia el techo, cómo cambia tu kernel, y qué tan lejos estás del pico en cada uno.

Goal

For the same FP16 matmul kernel (\(M = N = K = 8192\)), measure achieved FLOPS and HBM/RAM bandwidth on three distinct accelerators. Plot all three on a unified roofline. Explain the gap to peak on each.

This is the direct generalization of Phase 01's roofline lab to GPU hardware. The mental model is the same; the numbers are 10000× larger.

Prerequisites

  • Phase 01 roofline lab completed (experiments/01-roofline/ exists with i5-8250U numbers).
  • A runpod.io (or lambda.ai, or vast.ai) account.
  • PyTorch 2.4+ installed locally (for the CPU baseline) and available in the rental images.

Hardware targets

Target Provider Suggested SKU Hourly cost (2025) Run time Total cost
Intel i5-8250U Local (Borja's laptop) already done in Phase 01 $0
NVIDIA A100 80 GB SXM4 RunPod "A100 SXM 80GB" (Community Cloud) ~$1.50/h ~30 min ~$0.80
NVIDIA H100 80 GB SXM5 RunPod "H100 SXM 80GB" (Community Cloud) ~$3.00/h ~30 min ~$1.50

**Total cloud cost: ~\(2.30** (with 1 h budget headroom: ~\)5 ceiling).

Alternative providers (use whichever has best availability):

  • lambda.ai — A100 ~\(1.10/h on-demand, H100 ~\)2.49/h.
  • vast.ai — varies, often cheapest spot.

Setup script (for the cloud pod)

After SSH into the pod:

# 1. Verify hardware.
nvidia-smi
# Expect: "NVIDIA A100-SXM4-80GB" or "NVIDIA H100 80GB HBM3"; ~80 GB / ~141 GB depending on SKU.

# 2. Verify PyTorch CUDA build.
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# 3. Clone your work.
git clone <your-fork-of-lynx-cortex>
cd lynx-cortex
uv sync

The kernel under test

Single FP16 dense matmul. Same shape on every accelerator for direct comparability.

# experiments/x4-roofline/matmul_bench.py  (reference structure; learner writes the body)
import torch, time, json, sys
from pathlib import Path

def bench_matmul(M=8192, N=8192, K=8192, dtype=torch.float16, device="cuda", warmup=5, iters=20):
    """
    Returns (achieved_TFLOPS, achieved_GBps_one_way) for a single matmul.

    Notes for learner:
    - Use torch.cuda.Event with synchronize for GPU timing; time.perf_counter is wrong on GPU.
    - Total FLOPs for matmul: 2 * M * N * K (one mul + one add per element of C).
    - Bytes moved per op (lower bound): (M*K + K*N + M*N) * dtype.itemsize. This is the
      ideal-cache estimate; real HBM traffic can be higher.
    - Arithmetic intensity = 2*M*N*K / bytes_moved.
    """
    # TODO (learner): implement.
    ...

The learner writes the body. The autograded test in tests/extension/X4/test_lab00_bench.py checks that the function returns sensible numbers (within a tolerance band) on whatever CUDA device runs the test.

What to measure

For each device, record:

Metric How
Peak FP16 TFLOPS (vendor) From the datasheet (see theory/01 table)
Peak HBM/RAM bandwidth (vendor) From the datasheet
Achieved TFLOPS (your kernel) 2*M*N*K / median_time
Achieved bandwidth bytes_moved / median_time
Arithmetic intensity 2*M*N*K / bytes_moved (≈ 4096 for K=8192 cube — well above any roofline ridge)
MFU (model FLOPS utilization) achieved / peak FP16

Expected results (sanity table)

These are the numbers you should land within ±20% of. If you are way off, suspect a benchmarking bug (no warmup, no cuda.synchronize, wrong FLOP count).

Device Peak FP16 (TF/s) Peak BW (TB/s) Expected achieved TF/s Expected MFU
i5-8250U (CPU, FP32 via MKL) ~0.17 ~0.019 ~0.10-0.15 60-85%
A100 80 GB (FP16) 312 2.0 230-280 75-90%
H100 80 GB (FP16) 989 3.35 700-850 70-85%

[source: NVIDIA cuBLAS performance guide 2023; MLPerf inference matmul microbenchmarks]

Why these MFU values:

  • CPU (BLIS / MKL gemm) is extremely well-tuned; MKL on Kaby Lake delivers ~80% of peak.
  • A100 with cuBLAS hits ~85% on well-shaped matmuls.
  • H100 is slightly harder to saturate at FP16 without using FP8 + Transformer Engine; cuBLAS gets ~75-85% on this size.

If you want to push H100 further, switch to FP8 via transformer_engine.pytorch.Linear — you should see ~1500 TF/s (still below the 1979 peak, but closer).

Roofline plot

Plot all three devices on one log-log plot:

  • x-axis: arithmetic intensity (FLOP/byte).
  • y-axis: performance (FLOP/s).
  • For each device, draw the memory line (slope = BW) and the compute ceiling (horizontal at peak FLOPS).
  • Place your kernel as one point per device.

For an \(8192^3\) FP16 matmul, the arithmetic intensity is far above every device's ridge, so all three points should land near the compute ceiling. If yours land below, that's a story to tell: which knob is missing?

Gap explanation (the actual learning)

For each device, write 3-5 sentences explaining the gap between achieved and peak. Expected answers:

  • CPU: AVX2-FMA throughput limit; the L3 → register path is saturated; MKL's block-packing leaves a few % on the table.
  • A100: cuBLAS launch overhead per call; the kernel is not using TF32 (FP16 is the spec); HBM2e write-back of C is unavoidable.
  • H100: cuBLAS at FP16 does not use the new FP8 Transformer Engine path; the new asynchronous WGMMA instruction needs explicit programming to fully exploit; FP16 is "slow mode" on H100.

The interview-grade insight: gaps are normal and tell you which knob to turn. Moving CPU from naive → MKL is 100×. Moving H100 from FP16 cuBLAS → FP8 Transformer Engine is another 1.5-2×.

Deliverables

  • experiments/x4-roofline/manifest.json — pinned versions, seeds, SKUs, hourly cost.
  • experiments/x4-roofline/results.csv — one row per (device, dtype, shape, achieved_tflops, achieved_bw_gbps, mfu).
  • experiments/x4-roofline/roofline.png — the three-device plot.
  • experiments/x4-roofline/REPORT.md — 1 page narrating the gaps.

Definition of Done

  • All three devices measured.
  • Achieved FP16 TFLOPS on H100 ≥ 600 (i.e. MFU ≥ 60%).
  • Achieved FP16 TFLOPS on A100 ≥ 200 (i.e. MFU ≥ 65%).
  • Roofline plot is in the artifacts directory and rendered in the report.
  • Gap explanation paragraph is concrete (cites the missing optimization), not generic.

References

  • Williams S., Waterman A., Patterson D. 2009, Roofline: An Insightful Visual Performance Model for Multicore Architectures, CACM.
  • NVIDIA cuBLAS Performance Guide, 2023.
  • NVIDIA H100 Tuning Guide, 2023.