English · Español
04 — GPU Roofline¶
🇪🇸 La ecuación del roofline (
perf = min(π, I × β)) es la misma que en Fase 1 — pero ahora con múltiples ceilings (uno por dtype: fp64, fp32, fp16/bf16/fp8) y un machine balance entre 50 y 300 FLOPs/byte. La caché-attention del decode de Fase 22 sigue siendo memory-bound aquí, aún más profundamente que en la CPU.
This page is the GPU version of docs/phase-01-hardware-substrate/theory/03-roofline-model.md. The equation is identical; the constants are not; the operator placements you'll do in lab will use this plot.
The equation, unchanged¶
- \(\pi\): peak FLOPS (the compute ceiling).
- \(\beta\): peak bandwidth (the memory ceiling slope).
- \(I = F / B\): arithmetic intensity (FLOPs per byte).
Same as Phase 1. Only difference: on a GPU, \(\pi\) is dtype-dependent in a much bigger way than on CPU.
Multiple ceilings, one plot¶
A modern GPU has separate peak throughput numbers for:
- fp64 (CUDA cores)
- fp32 (CUDA cores)
- TF32 (Tensor Cores; ~16× fp32)
- fp16 / bf16 (Tensor Cores; ~16× fp32)
- fp8 (Tensor Cores; ~32× fp32, H100+)
- int8 (Tensor Cores; same as fp8 throughput-wise)
- int4 (Tensor Cores; H100+)
For example, an A100:
| Dtype | Peak TFLOPS (Tensor Cores) | Peak TFLOPS (CUDA cores) |
|---|---|---|
| fp64 | 19.5 (dense) | 9.7 |
| fp32 / TF32 | 156 (TF32) | 19.5 |
| fp16 / bf16 | 312 | 78 |
| fp8 | N/A (H100+) | N/A |
| int8 | 624 | 312 |
HBM bandwidth: 2 TB/s (A100 PCIe), 1.55 TB/s (A100 SXM4 40GB), up to 2 TB/s (A100 SXM4 80GB).
Each dtype gives a different compute ceiling line on the roofline. They all share the same memory ceiling slope (HBM bandwidth is dtype-agnostic — it moves bytes, not FLOPs). So the plot looks like:
TFLOPS (log)
^
│ ╭───────── ← fp8/int8 Tensor Cores (624)
│ ╱
│ ╭───────────── ← fp16/bf16 TC (312)
│ ╱
│ ╭───────────────── ← TF32 TC (156)
│ ╱
│ ╭──────────────────── ← fp32 CUDA cores (19.5)
│ ╱
│ ╭────────────────────────── ← fp64 (9.7)
│ ╱
│ memory ceiling slope = β = 2 TB/s
│ (same for every dtype)
│ ╱
│ ╱
└──────────────────────────────────────────→
I (FLOPs/byte, log)
The corners (\(I_\text{crit}\) for each dtype) shift right as the compute ceiling rises:
- fp8: \(I_\text{crit} = 624 / 2 = 312\) FLOPs/byte
- fp16: \(I_\text{crit} = 156\)
- fp32 (CUDA core): \(I_\text{crit} = 9.75\)
- fp64: \(I_\text{crit} = 4.85\)
A kernel at \(I = 1\) FLOP/byte (decode attention fp16) is on the memory slope for every dtype — far below all the corners. Memory-bound regardless of dtype.
A kernel at \(I = 20\) FLOPs/byte is: - memory-bound for fp16 Tensor Cores (below \(I_\text{crit} = 156\)). - compute-bound for fp32 CUDA cores (above \(I_\text{crit} = 9.75\)).
Same kernel, different regime under different precision. This is why "use fp16" doesn't always help — for kernels well in the compute-bound regime, fp16 just halves the memory traffic without raising the peak you're actually hitting. For kernels in the memory-bound regime, fp16 doubles intensity (halves bytes per FLOP) and gives near-2× speedup.
Re-placing the Phase-22 operators on the GPU roofline¶
Now the payoff. The same operators we placed on the CPU roofline in Phase 22 land here on the GPU plot:
Prefill attention (\(P \times P\))¶
- FLOPs per layer: \(2 P^2 d\). For Llama-2-7B fp16 (\(d=4096\)), \(P=2048\): \(\approx 7 \cdot 10^{10}\) FLOPs/layer × 32 layers = \(2.2 \cdot 10^{12}\) FLOPs.
- Bytes (no Flash-Attention, materializing \(P \times P\)): \(\sim 4 P^2 \cdot 2\) bytes for fp16 attention matrix + the K, V reads. Working set is large; spills to HBM.
- Intensity: dominated by attention matrix traffic. For \(P=2048\), \(I \approx P = 2048\) — deeply compute-bound. Sits at the fp16 TC ceiling.
- With Flash-Attention (Phase 24 lab): the attention matrix is never materialized; working set stays in SMEM; intensity rises further. Same compute, less bytes, same ceiling — but now the kernel can actually hit it (Flash-Attention reaches ~75% of fp16 TC peak in practice).
Decode-step attention (\(1 \times S\) per layer)¶
- FLOPs per layer per step: \(\approx 4 S d\). For Llama-2-7B, \(S=4096\): \(4 \cdot 4096 \cdot 4096 = 6.7 \cdot 10^7\) FLOPs/layer/step. Times 32 layers = \(2.1 \cdot 10^9\) FLOPs/step.
- Bytes per layer per step: \(2 S d \cdot s\) for cache read. \(S=4096\), \(s=2\): \(6.7 \cdot 10^7\) bytes/layer/step. Times 32 layers = \(2.1 \cdot 10^9\) bytes/step.
- Intensity: \(2.1 \cdot 10^9 / 2.1 \cdot 10^9 = 1\) FLOP/byte. As derived in Phase 22.
- Place on roofline: \(I = 1\), far left of every corner. Memory-bound at \(\text{perf} = 1 \cdot 2\text{ TB/s} = 2\) TFLOPS. Fraction of fp16 TC peak: 2 / 312 = 0.6%. The FPUs sit at 99.4% idle during decode attention. Same diagnosis as on CPU, with even harsher absolute numbers.
Decode-step FFN¶
- FLOPs per layer per step: \(24 d^2\) (two matmuls). For Llama-2-7B: \(24 \cdot 4096^2 = 4 \cdot 10^8\) FLOPs/layer/step. Times 32 = \(1.3 \cdot 10^{10}\) FLOPs/step.
- Bytes per layer per step: \(\approx 12 d^2 s\) (weight read). \(12 \cdot 4096^2 \cdot 2 = 4 \cdot 10^8\) bytes/layer/step. Times 32 = \(1.3 \cdot 10^{10}\) bytes/step = 13 GiB read per token.
- Intensity: \(1.3 \cdot 10^{10} / 1.3 \cdot 10^{10} = 1\) FLOP/byte. Same as decode attention. Memory-bound at 2 TFLOPS.
- Time at 2 TB/s HBM: 13 GiB / 2000 GB/s ≈ 6.5 ms / token. This is the floor for single-stream decode on A100. Real measurements come in at 10–15 ms — close to the bound.
Decode-step FFN, batched (\(B\) sequences)¶
- FLOPs per layer per step: \(24 B d^2\).
- Bytes per layer per step: \(\approx 12 d^2 s\) (weights read once, applied to \(B\) rows).
- Intensity: \(2B / s = B\) (for fp16). At \(B=16\): \(I = 16\). Still memory-bound for fp16 TC (\(I_\text{crit} = 156\)). Need \(B \geq 156\) to become compute-bound for fp16 TC FFN, which is impractical (KV cache for 156 concurrent sequences = 312 GiB at 4k context — way over single-GPU). Realistic decode batching (\(B \in [16, 64]\)) stays memory-bound but gets a \(B\)-fold throughput improvement.
Prefill GEMM (the large FFN matmul in prefill)¶
- \(P \times d \cdot d \times 4d\). FLOPs: \(8 P d^2\). Bytes: \(\sim 12 P d^2 s\) (loading X, weights). Intensity: \(\sim 2/s = 1\) for fp16 — wait, that's the same as decode. Why isn't prefill memory-bound too?
- Answer: at prefill, you process all \(P\) tokens at once; the matmul is \(P\)-dimensional, like decode-with-batch-\(P\). Effective intensity = \(\sim 2 P / s = P\). For \(P = 2048\) fp16: \(I = 2048\) — deep compute-bound. The cost goes back to "matmul FLOPs in fp16 TC", peak ~312 TFLOPS.
This is why prefill is fast (compute-bound, hitting near-peak fp16 TC) and decode is slow (memory-bound, ~1% of compute peak). The asymmetry on the GPU is more extreme than on CPU because the GPU's compute-to-bandwidth ratio is so high.
The plot you'll commit at the end of lab¶
The GPU roofline plot (experiments/23-roofline-gpu/roofline.png) should look like the schematic above, with:
- Five compute ceilings (one per relevant dtype).
- One memory ceiling (sloped).
- At least four dots: decode-attention fp16 (I=1), decode-FFN fp16 single-stream (I=1), decode-FFN fp16 batched-16 (I=16), prefill-FFN fp16 (I≈P).
- A horizontal "your measured cuBLAS GEMM" line indicating real attainable peak (typically 70–90% of vendor peak).
This single plot is the operator map for the rest of the inference work. Phase 24 will move dots up the slope (kernel optimization) or right (precision lowering). Phase 27 will reorganize memory layout to move bytes around. Phase 33 will batch to raise effective intensity. Every move has a roofline-vocabulary description.
How this relates to "vendor specs lie"¶
When NVIDIA quotes "989 TFLOPS for H100 fp16", they mean dense fp16 Tensor Core peak. That number assumes:
- Tensor Cores fully fed (not CUDA cores).
- No dependency stalls.
- All 132 SMs active.
- No memory-bound segments.
A real model achieves 30–70% of this number, depending on the operator. cuBLAS GEMM at large sizes (e.g., 8192×8192 matmul) hits ~80%. Anything memory-bound hits <5%.
When you read "H100 is 6× faster than A100", what's meant is: \(\pi_\text{H100} / \pi_\text{A100} = 989/312 \approx 3.2\times\) for fp16 dense, and bandwidth-wise \(\beta_\text{H100} / \beta_\text{A100} \approx 1.6\times\). For compute-bound work, you get ~3.2×. For memory-bound work (decode), you get ~1.6×. Most LLM inference is memory-bound, so the practical speedup is closer to 1.6× per GPU, not 6×.
A user looking at a vendor benchmark that shows "H100 = 6× A100 for LLM inference" should ask: which operator are they measuring? If it's prefill / training (compute-bound), the 6× is real. If it's decode (memory-bound), the 6× requires also batching very hard (so weight-reads amortize, raising effective intensity into the compute-bound regime).
This is the deepest take-away of the roofline: performance is operator-shaped, not chip-shaped.
Drill problems¶
- Compute machine balance for fp16 Tensor Cores on H100 (\(\pi \approx 989\) TFLOPS, \(\beta \approx 3.35\) TB/s).
- The Phase-22 decode attention at fp16 on H100: place on the H100 fp16 TC roofline. Attainable performance? Fraction of peak?
- A custom Flash-decoding kernel raises decode attention's effective intensity (by keeping the working set in SMEM longer) by ~4×. Where does the dot move? Speedup factor?
- Same kernel, quantized cache (int8 instead of fp16). Bytes halve; FLOPs unchanged. New intensity? New attainable performance?
What you should now be able to do¶
- Sketch the multi-dtype GPU roofline from memory.
- Place any of Phase 22's operators on it correctly.
- Predict speedup from a proposed optimization (Flash, quantization, batching) by computing intensity change.
- Read a vendor claim ("X is 5× faster than Y") and decompose it into compute and memory components.
What this page does NOT cover¶
- Power / TDP ceilings. Real silicon throttles at sustained peak; the rooflines here are nominal.
- Sparse Tensor Cores (2:4 sparsity). Doubles peak for structured-sparse weights; quantization survey Phase 26 mentions, deep dive Phase 27+.
- Multi-GPU rooflines (NVLink as a third ceiling). Phase 35.
Next: lab/00-provision-cloud-gpu.md. The mental model is built; time to rent the hardware.