Skip to content

English · Español

Lab 03 — GPU Roofline: Plot and Operator Placement

Goal: plot the multi-dtype roofline for your rented GPU using measured peaks (from labs 01 + 02 + a fresh cuBLAS GEMM benchmark). Place Phase-22's decode operators on it.

Estimated time: 90–120 minutes.

Prereq: labs 00, 01, 02 complete. Phase-22 MiniGPT working on CPU (with cache).


What you produce

experiments/23-roofline-gpu/:

  • peak_flops.py — cuBLAS GEMM benchmark across dtypes.
  • peak_flops.json — measured peak TFLOPS for fp32, fp16, bf16 (and fp8 if supported).
  • decode_attn_gpu.py — port Phase-22 MiniGPT decode-attention (single step, attention only) to cupy arrays; measure latency and compute intensity + attained TFLOPS.
  • roofline.png — the plot.
  • roofline.json — all data feeding the plot.
  • interpretation.md — 3 paragraphs.

The benchmarks

Block A — measured peak FLOPS via cuBLAS GEMM

For each dtype in {fp32, fp16, bf16} (and fp8 if compute capability ≥ 9.0):

  • Run cupy.matmul(A, B) for square matrices 4096 × 4096 (large enough to be compute-bound).
  • Warm up 3 iters, then 100 iters timed.
  • Record measured TFLOPS. Compare to vendor peak from device_query.json. Expect 70–90% of vendor peak — that's cuBLAS doing real work near hardware limit.

If measured < 50% of vendor peak: investigate. Likely culprits: - Matrix too small (try 8192 × 8192 if HBM allows). - Boost clock not engaged (load GPU with sustained work and re-measure). - Wrong cuBLAS path (Tensor Core vs CUDA core); cuBLAS chooses by heuristic and may pick wrong.

Block B — Phase-22 decode-attention on GPU

  • Load Phase-17 MiniGPT weights from disk to host, then to device (cupy.asarray).
  • Port the attention forward path: cache K, V in cupy arrays; query is a single token's hidden state.
  • For S in {128, 256, 512, 1024}:
  • Pre-fill the cache with S random K, V entries.
  • Time 100 decode steps (single token attention against cache of size S).
  • Synchronize before/after timing.
  • Compute: per-step latency (ms), FLOPs per step (from theory in phase-22/theory/03), bytes per step (also from theory). Intensity = FLOPs / bytes. Attained TFLOPS = FLOPs / time.

Phase-22 said intensity ≈ 0.5–1.0 FLOPs/byte (dtype-dependent). On GPU with fp32 cache: I = 0.5. With fp16 cache (if you cast to fp16 first): I = 1.0.

Block C — plot

  • x-axis: arithmetic intensity (log).
  • y-axis: TFLOPS (log).
  • Ceiling lines:
  • fp32 CUDA core peak (measured, not datasheet).
  • fp16 Tensor Core peak (measured).
  • bf16 Tensor Core peak (measured).
  • HBM bandwidth slope (measured D2D from lab 02).
  • Operator dots:
  • cuBLAS GEMM at 4096×4096 fp16 (measured peak, intensity ≈ 2048).
  • cuBLAS GEMM at 4096×4096 fp32 (measured).
  • Phase-22 decode-attention fp32 (intensity 0.5, attained from Block B).
  • Phase-22 decode-attention fp16 (intensity 1.0, attained).
  • Annotate the fraction of peak near each operator dot.

Block D — interpret in interpretation.md

Three paragraphs:

  1. Where do the cuBLAS dots land? Right at the corresponding ceiling? Within 10–20% of ceiling? This validates that the roofline diagnosis is real — cuBLAS at large sizes saturates the dtype's peak.
  2. Where do the decode-attention dots land? Deeply on the memory slope (as predicted by Phase-22 theory). What fraction of fp16 TC peak are you achieving? Should be ~0.3–1% — exactly the "decode is 1% of peak" diagnosis.
  3. Compare to the CPU roofline from Phase-1. What is the ratio of attained TFLOPS on the same operator (decode-attention)? Is it the ratio of HBM-to-DRAM bandwidths (about 100×)? Or smaller (constant-factor overhead from PCIe, kernel launch, etc.)?

TODOs (consolidated)

  • Block A: peak_flops.py, peak_flops.json, comparison to vendor peak.
  • Block B: decode_attn_gpu.py, latency measurements at S in {128, 256, 512, 1024}.
  • Block C: roofline.png with all ceilings + dots + annotations.
  • Block D: interpretation.md with 3 paragraphs.
  • Manifest at experiments/23-roofline-gpu/manifest.json.

Constraints

  • cupy only. Not PyTorch yet. (Plan §7.g.)
  • Time only after deviceSynchronize. Both before and after the timed block.
  • Use measured peaks, not datasheet peaks, for the ceiling lines.
  • Don't include the cache write time in the decode-attention measurement. That's a separate operator; we're measuring the attention only.

Stop conditions

Done when:

  1. cuBLAS GEMM in 3+ dtypes measured, all within 30% of vendor peak.
  2. Decode-attention measured at 4+ S values, plotted on the roofline.
  3. roofline.png committed with all ceilings and dots.
  4. interpretation.md answers all three questions.
  5. manifest.json complete.

Pitfalls

  • cupy.matmul may pick fp32 path for fp16 inputs depending on cuBLAS version. Force the dtype by ensuring both inputs are the same dtype and check the kernel actually used (via nvprof or ncu if available; otherwise trust the throughput number).
  • Decode-attention is so memory-bound that the timing is noisy. Each measurement is short. Run 1000 iterations and report median, not mean.
  • Matrix sizes that are too large can spill to HBM if the working set exceeds L2. Pick sizes where you're confident about which tier you're testing.
  • Cost watch. This lab can chew through 1–2 cloud-GPU hours. Watch the budget.

When to consult solutions/

After all stop conditions met. The reference at solutions/03-gpu-roofline-ref.md shows expected numbers for the reference GPU.


Next: PHASE_23_REPORT.md. The phase is done after report + reflection + a final "instance terminated, total cost recorded" item on the checkpoint.