English · Español

05 — KV-cache memory math, worked end-to-end on the Phase-17 mini-GPT¶

🇪🇸 Aplicamos la fórmula bytes = 2 · L · H · d_h · S · B · s al mini-GPT exacto que construiste en la Fase 17. Vemos que el caché ocupa kilobytes, no gigabytes — y por qué incluso a esta escala las cuentas valen la pena: te dan el hábito que necesitas en producción.

This page extends theory/02-memory-cost.md with a concrete numerical worked example using the Phase-17 mini-GPT exactly as specified. The arithmetic is small enough to do by hand; the methodology scales to GPT-3.

Recap of the Phase-17 mini-GPT spec¶

From docs/phase-17-mini-gpt/README.md:

\(d_\text{model} = 64\)
\(n_\text{heads} = H = 4\)
\(d_h = d_\text{model} / n_\text{heads} = 16\)
\(n_\text{layers} = L = 2\)
\(d_\text{ff} = 256\)
Vocab (after BPE on §A13 corpus): \(V \approx 512\)
Max sequence length (per Phase 16): \(S_\text{max} = 128\)

For inference we use full-precision fp32 (\(s = 4\) bytes/element) unless noted.

The formula, instantiated¶

\[\text{bytes}_\text{cache} = 2 \cdot L \cdot H \cdot d_h \cdot S \cdot B \cdot s\]

Plug in fixed parts:

\[\text{bytes} = 2 \cdot 2 \cdot 4 \cdot 16 \cdot S \cdot B \cdot s = 256 \cdot S \cdot B \cdot s\]

For fp32 (\(s = 4\)):

\[\text{bytes} = 1024 \cdot S \cdot B = 1\text{ KiB} \cdot S \cdot B\]

For fp16 / bf16 (\(s = 2\)):

\[\text{bytes} = 512 \cdot S \cdot B = 0.5\text{ KiB} \cdot S \cdot B\]

Per-token, per-sequence increment (with \(B = 1\), fp32): 1 KiB per generated token.

Concrete cases¶

Case A: single-sequence decode, full max-context¶

\(B = 1\), \(S = S_\text{max} = 128\), fp32:

\[\text{bytes} = 1024 \cdot 128 \cdot 1 = 131{,}072 \text{ bytes} = 128 \text{ KiB}\]

Trivial. Fits in L1 cache 32× over on the i5-8250U (which has 32 KiB L1d per core, so the cache spills to L2 — also fine at 256 KiB).

Case B: small batch decode, full context¶

\(B = 8\), \(S = 128\), fp32:

\[\text{bytes} = 1024 \cdot 128 \cdot 8 = 1{,}048{,}576 \text{ bytes} = 1 \text{ MiB}\]

Still trivial. Fits in i5-8250U L3 (6 MiB).

Case C: max-batch decode, fp16¶

If we pushed to \(B = 64\) (which exceeds what the §A13 corpus is meaningful at, but useful for the scaling exercise), \(S = 128\), fp16:

\[\text{bytes} = 512 \cdot 128 \cdot 64 = 4 \text{ MiB}\]

Still in L3.

Case D: comparison with model weights¶

Mini-GPT weight count (rough): \(V \cdot d_\text{model} + L \cdot (4 \cdot d_\text{model}^2 + 2 \cdot d_\text{model} \cdot d_\text{ff} + 2 \cdot d_\text{model}^2_\text{FFN})\) — approximated as ~50k weights, plus ~50k for embeddings and LM head. Call it ~100k weights × 4 bytes = 400 KiB.

So at \(B = 1, S = 128\) fp32, the KV cache (128 KiB) is ~32% the size of the weights. The KV cache is non-trivial as a fraction, but the absolute size is small.

This is the §A13 lesson: cache vs weight ratio at our scale is similar to GPT-3 at long context. The arithmetic scales.

Comparison with GPT-2-small¶

Sanity-check the methodology on a familiar number:

GPT-2 small: \(L = 12, H = 12, d_h = 64, S_\text{max} = 1024\), fp16, \(B = 1\):

\[\text{bytes} = 2 \cdot 12 \cdot 12 \cdot 64 \cdot 1024 \cdot 1 \cdot 2 = 37{,}748{,}736 \approx 36 \text{ MiB}\]

This matches the cited 36 MiB figure in the literature for GPT-2-small inference cache. The methodology checks out.

Per-token marginal cost: §A13 mini-GPT vs GPT-3¶

Model	Per-token cache cost (fp16)
§A13 mini-GPT	\(2 \cdot 2 \cdot 4 \cdot 16 \cdot 2 = 512\) bytes / token
GPT-2 small	\(2 \cdot 12 \cdot 12 \cdot 64 \cdot 2 = 36{,}864\) bytes / token = 36 KiB / token
GPT-3 175B	\(2 \cdot 96 \cdot 96 \cdot 128 \cdot 2 = 4{,}718{,}592\) bytes / token = 4.5 MiB / token

The per-token cost grows by 4 orders of magnitude across these models. At GPT-3 scale, generating 100 tokens uses 450 MiB of cache per sequence. At §A13 scale, the same generation uses 50 KiB. Same formula, same exponent structure, vastly different absolute scale.

Implications for prefill vs decode¶

In theory/01-prefill-vs-decode.md we saw that decode is memory-bound because the model loads its entire weights (\(\sim\) 400 KiB) to produce one new token, while the FLOPs needed are tiny (a few thousand). The bytes-per-token is \(\sim 400\) KiB regardless of \(S\), while the cache adds \(\sim 1\) KiB per generated token. At long \(S\), the cache moves the model's effective memory footprint up but doesn't change the byte-per-token transfer cost meaningfully — the weights dominate.

This is why at §A13 scale, the KV cache is a pedagogical lever, not a performance lever. We implement it correctly so the methodology is in your fingers when you encounter GPT-3-scale models in Phase 23+. The mini-GPT itself doesn't need the cache for speed.

The "off-by-one in position index" silent corruption¶

The most common KV-cache bug is forgetting that when generating token \(t+1\), the cache holds K, V for positions \(0\) through \(t\), so the new K, V slot goes at index \(t\) — not \(t+1\). With an off-by-one, the new K, V overwrite position \(t-1\)'s slot, and the attention at the next step queries a corrupted history. The result: decode loss looks fine for short sequences but degrades on long ones, because the corruption compounds.

Phase 22's break exercise (break/00-...) demonstrates this exactly.

Sanity table for memorization¶

For the §A13 mini-GPT in fp32, memorize these numbers:

Per-layer per-token K size: \(H \cdot d_h \cdot s = 4 \cdot 16 \cdot 4 = 256\) bytes.
Same for V. Total per-layer per-token: 512 bytes.
× 2 layers: 1 KiB per token per sequence. Memorize this — the "1 KiB / token" mnemonic.
For B = 8 and S = 128: \(1\text{ KiB} \cdot 128 \cdot 8 = 1\) MiB.

These three numbers are enough to estimate any cache configuration in your head.

Citation¶

Pope, R. et al. (2023). Efficiently Scaling Transformer Inference. https://arxiv.org/abs/2211.05102 — sections 2-3 derive the KV cache memory formula and analyze prefill / decode trade-offs at scale. Their formula matches ours (we just dropped their group-attention factor since vanilla mini-GPT is full multi-head).

One-paragraph recap¶

The Phase-17 mini-GPT's KV cache costs \(1024 \cdot S \cdot B\) bytes in fp32, or \(1\) KiB per (sequence-position, batch) cell. For \(B = 1, S = 128\) this is 128 KiB — about 32% of the weight size. The methodology generalizes: applying the same formula to GPT-3-175B gives 4.5 MiB per generated token per sequence. At §A13 scale, the cache is correct, not fast — its purpose is to give you the methodology and the bugs (off-by-one, dtype mistakes, batch-dim flips) before they cost you wall-clock at production scale.

Cross-refs: theory/02-memory-cost.md (the formula derivation), theory/03-decode-as-memory-bound.md (why cache helps less than you'd think on mini-GPT), Phase 17 README.md (the model spec being cached).