English · Español

03 — Decode is Memory-Bound (and Why That Reshapes Everything)¶

🇪🇸 La intensidad aritmética del paso de decode con caché es ~0.5 FLOPs/byte. Eso lo sitúa muy por debajo del machine balance de cualquier GPU moderna (típicamente 50–300 FLOPs/byte). Conclusión: el cuello de botella no es cómputo, es leer la caché de memoria. Todo lo demás que oirás sobre inferencia eficiente sigue de ahí.

This page does for the decode step what phase-01/theory/03-roofline-model.md did for matmul: place the operator on the roofline and read off the regime.

Setup¶

A single decode step, single sequence, single layer, all \(H\) heads at once. The cache is currently length \(S\).

The attention work per head is:

\(q \in \mathbb{R}^{1 \times d_h}\) (current token's query for this head).
\(K \in \mathbb{R}^{S \times d_h}\), \(V \in \mathbb{R}^{S \times d_h}\) (cached keys and values for this head).
\(\text{scores} = q K^\top / \sqrt{d_h} \in \mathbb{R}^{1 \times S}\).
\(\text{weights} = \text{softmax}(\text{scores}) \in \mathbb{R}^{1 \times S}\).
\(\text{out} = \text{weights} \cdot V \in \mathbb{R}^{1 \times d_h}\).

Repeat for all \(H\) heads independently.

FLOPs¶

Per head: - \(q K^\top\): \(S \cdot d_h\) multiplies + adds = \(2 S d_h\) FLOPs. - softmax: ~\(5 S\) FLOPs (exp + sum + divide + subtract max), but constant factor small. Ignore for asymptotics. - \(\text{weights} \cdot V\): \(2 S d_h\) FLOPs.

Per head, per layer, per step: \(\approx 4 S d_h\) FLOPs (rounding scaling and softmax overhead).

All heads, per layer, per step: \(4 S \cdot d_h \cdot H = 4 S d\) FLOPs.

All layers: \(4 L S d\) FLOPs.

Bytes moved¶

This is the load-bearing calculation. Per head, the operator must read:

\(q\): \(d_h\) elements. Tiny.
\(K\): \(S \cdot d_h\) elements. Linear in \(S\).
\(V\): \(S \cdot d_h\) elements. Linear in \(S\).
Writes: \(\text{out}\) at \(d_h\), plus \(k_\text{new}, v_\text{new}\) at \(d_h\) each. Tiny.

Reads per head: \(\approx 2 S d_h\) elements, dominated by K and V.

Across \(H\) heads in one layer: \(2 S d_h H = 2 S d\) elements.

Across \(L\) layers: \(2 L S d\) elements. At \(s\) bytes/element: \(\boxed{B_\text{decode-cache} = 2 L S d s}\) bytes.

(We're ignoring weight bytes here because we already accounted for them in 01-prefill-vs-decode.md and they dominate a different operator — the FFN. Here we focus on the attention operator only, to isolate why the cache is the bottleneck.)

Arithmetic intensity¶

\[I_\text{attn-decode} = \frac{F}{B} = \frac{4 L S d}{2 L S d s} = \frac{2}{s}\]

dtype	\(s\)	\(I_\text{attn-decode}\)
fp32	4	0.5 FLOPs/byte
fp16	2	1.0 FLOPs/byte
int8	1	2.0 FLOPs/byte

Notice what isn't in the formula. No \(S\). No \(L\). No \(d\). The arithmetic intensity of the decode-attention operator is a constant — determined only by your cache dtype.

That constant is 0.5–2.0. For comparison, an A100 has machine balance \(\pi / \beta \approx 312 / 2 \approx 156\) FLOPs/byte. An H100 sits around 280. An i5-8250U (Phase 1 numbers) sits around 10.

On any modern GPU, decode attention runs at 0.3–1% of peak FLOPS, with the FPUs idle waiting on HBM. On the i5-8250U, it runs at maybe 5% of peak. The decoder is memory-bound everywhere.

The "weight read" companion fact¶

The attention operator above is the cache portion only. But each decode step also re-reads the model weights to do the FFN, the QKV projection, and the output projection. The weight matrices are the same across all sequence positions:

QKV projection: \(3 d^2\) elements per layer.
FFN: \(8 d^2\) elements per layer.
Output projection: \(d^2\) elements per layer.

Per layer per step: \(\approx 12 d^2\) weight elements. Across \(L\) layers: \(12 L d^2\) elements. At \(s\) bytes/element: \(12 L d^2 s\) bytes — per decode step.

For Llama-2-7B, fp16: \(12 \cdot 32 \cdot 4096^2 \cdot 2 \approx 12.9 \cdot 10^9\) bytes ≈ 13 GiB read per token.

On an A100 with 2 TB/s HBM, that's \(13 / 2000 \approx 6.5\) ms of pure memory time per token, lower bound. Real numbers are 10–15 ms/token — close to the bound. The model is decoding as fast as the HBM can spit out weights, regardless of how many FPUs are sitting around.

The roofline diagram of decode¶

Plot the decode attention operator on a GPU roofline:

GFLOPS (log)
   ^
 π ┤                              ╭─────────────  ← compute ceiling (e.g. 312 TFLOPS A100)
   │                             ╱
   │                            ╱
   │                           ╱
   │                          ╱
   │   • decode attn (fp16)  ╱  ← machine balance at ~156 FLOPs/byte
   │     I=1.0, far left    ╱
   │                       ╱
   │  • decode attn (fp32)╱
   │    I=0.5            ╱
   │   ↓               ╱
   │  on the slope    ╱
   └─────────────────┴─────────────────────────→
                    I_crit            arithmetic intensity (log)

The two decode-attention dots sit deeply on the memory-bound slope, far below the corner. To move them right (raise intensity):

Quantize the cache. fp16 → int8 doubles intensity. int8 → int4 doubles again. (Phase 26.)
Reduce cache rows. GQA: share K, V across head groups, reducing the \(K, V\) bytes read per query head. The math changes from \(2 S d\) to \(2 S d \cdot (H_{KV} / H)\). (Phase 27.)
Reduce cache columns. Sliding-window attention: read only the last \(W \ll S\) tokens. Bytes drop to \(2 L W d s\), constant in \(S\) once \(S > W\). (Phase 27.)
Restructure to read the cache once per group of queries. Speculative decoding: \(k\) candidate tokens validated in parallel use the same cache read. Effective intensity rises \(k\)-fold. (Phase 36.)
Restructure to fuse softmax with mat-mul (Flash-decoding). Stream the cache through SRAM, never materialize the softmax intermediate. Intensity unchanged but \(\beta\) effectively rises because you stay in SRAM longer. (Phase 24.)

Every one of those is a different attack on the same diagnosis: the dot is too far left on the roofline. Re-read this list when you read the optimization names in later phases — none of them is mysterious once you see them as movements on this plot.

The batching escape hatch¶

Batching has a special role in the decode-memory story.

Imagine \(B\) users all decoding concurrently. The model weights are shared across users — read once per step, used by \(B\) queries. Cache is per-user (still \(2 L S d s\) per user; total \(2 L B S d s\)).

Weight bytes per step: \(12 L d^2 s\) (unchanged — read once, applied to \(B\) queries).

Total FLOPs per step (FFN dominates): \(\approx 24 L B d^2\).

Weight intensity: \(24 L B d^2 / (12 L d^2 s) = 2 B / s\).

So batching by \(B\) raises the FFN's arithmetic intensity by a factor of \(B\). At \(B = 16\) fp16: \(I = 16\). At \(B = 64\) fp16: \(I = 64\). That brings the FFN dot up toward the corner. The cache-attention dot, however, is unaffected by batching — each user reads their own cache.

This is why serving systems push batch sizes hard (the FFN is the heavy expense and benefits) but PagedAttention is necessary (the per-user cache is what blocks batching at high \(B\) with variable-length sequences).

Drill problems¶

Phase-1 i5-8250U roofline (from docs/phase-01.../theory/03-roofline-model.md): \(\pi \approx 200\) GFLOPS, \(\beta \approx 20\) GB/s. Decode attention at fp32: \(I = 0.5\). Attainable performance? At fp16 (\(I = 1.0\))?
On an H100 (\(\pi \approx 990\) TFLOPS fp16, \(\beta \approx 3.35\) TB/s): machine balance? Where does decode-attention-fp16 sit? What fraction of peak?
Why doesn't lowering \(d_h\) (smaller heads) help the intensity? (Hint: both FLOPs and bytes scale linearly in \(d_h\) — the ratio is invariant.)
Grammar MiniGPT decodes "Yesterday I worked" at fp32 on Borja's i5-8250U. Predict the per-token decode attention latency. At our scale this is unmeasurable noise — Python interpreter overhead dwarfs it. Explain why the formula is still the right tool: how do you extrapolate from MiniGPT-scale measurements to predict Llama-2-7B decode latency?

Why this page is the page for the rest of the curriculum¶

Phase 22 is the first time you'll measure an operator that turns out to be memory-bound. It will not be the last. From Phase 22 forward, every time someone proposes "speed up inference", the question to ask is:

Is this operator currently compute-bound or memory-bound? If memory-bound, what does this optimization do to its arithmetic intensity?

That question is sometimes embarrassingly clarifying. Many proposed "kernel fusions" turn out to be no-ops because they don't change the bytes moved. Many "quantization wins" turn out to compound exactly because they directly raise intensity. Once the diagnostic is sharp, the field becomes legible.

What this page does NOT cover¶

The HBM↔SRAM rearrangement that Flash-Attention/Flash-Decoding actually does. Named here; mechanism is Phase 24 / 27.
PagedAttention's block allocator. Sketched in theory/04-toward-paged-attention.md; derived in Phase 27.
Quantized-cache arithmetic. Phase 26.
Speculative decoding math. Phase 36.
CUDA-level measurement. This page argues from intensity; Phase 24 measures from kernels.

Next: theory/04-toward-paged-attention.md — a short preview of why dense pre-allocated caches break under realistic serving workloads.