English · Español

Phase 22 — KV Cache: From Math to Memory¶

Requires: 15 — Attention from Scratch · 21 — Inference Internals & Sampling Teaches: kv-cache · prefill · decode · memory-bound · arithmetic-intensity Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per §A12. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

Topic: English verb grammar per §A13. Cache demos use prompts like "Yesterday I" extended to "Yesterday I worked" — the autoregressive decode the cache exists to accelerate.

🇪🇸 Sin caché KV, cada token nuevo recalcula la atención sobre todos los anteriores: O(n³) total. Con caché, cada token mira una sola fila contra las claves ya guardadas: O(n²) total, a cambio de memoria que crece linealmente. Esta fase es donde la inferencia deja de ser una broma y empieza a ser un sistema.

Goal¶

Build a dense KV cache in NumPy that:

Turns autoregressive generation of Phase 17's MiniGPT (verb-grammar-trained) from quadratic-per-token compute into linear-per-token compute.
Produces byte-identical outputs to the no-cache path (every divergence is a bug, not a feature).
Has a memory footprint Borja can derive on paper before measuring.

This is the second-most-important derivation phase after Phase 15 (attention itself). The theory pages are dense and worth slow reading.

Read order¶

theory/00-motivation.md — why a cache exists at all; the prefill/decode dichotomy via the worked example "Yesterday I worked".
theory/01-prefill-vs-decode.md — the two phases of inference; FLOPs accounting on grammar-corpus prompts.
theory/02-memory-cost.md — the size formula 2 · L · H · d_h · S · B · s; per-token marginal cost; scaling to real models.
theory/03-decode-as-memory-bound.md — arithmetic intensity of single-token attention; why decode is bandwidth-bound; preview of Flash-decoding (Phase 24) and PagedAttention (Phase 27).
theory/04-toward-paged-attention.md — short preview of why dense pre-allocation breaks down under variable-length batched serving; problem statement only, solution deferred to Phase 27.
lab/00-derive-cache-size.md — paper exercise: derive memory cost for grammar MiniGPT, Llama-2-7B, GPT-3-175B at three context lengths. No code.
lab/01-implement-cache.md — implement src/minicache/cache.py. Failing tests live in tests/test_minicache.py.
lab/02-correctness-test.md — property test: with-cache equals without-cache, byte-identical, over 50 grammar prompts × 64 tokens; flagship slot-level dump of "Yesterday I" → "worked".
lab/03-cost-curves.md — measure memory + latency vs context length; overlay theory curves; commit experiments/22-kv-cost/.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase-15 attention API is confirmed.

Definition of Done¶

See PHASE_22_PLAN.md §6. Briefly:

src/minicache/{cache,layout}.py implemented; mypy-strict clean; pytest green.
experiments/22-kv-cost/ shows memory and latency curves within 10% of theoretical predictions.
experiments/22-cache-correctness/ confirms byte-identical generation with vs without cache over 50 × 64 tokens.
experiments/22-yesterday-worked/ shows the K, V slot for "I" matches between cached decode and full recompute.
Borja can derive the cache-size formula on a whiteboard from (L, H, d_h, S, B, s) without looking, and project it to Llama-2-7B and GPT-3.

What this phase intentionally does NOT cover¶

GPU KV cache. Phase 22 stays CPU + NumPy. The cache moves to GPU memory in Phase 24, and the layout changes (HBM vs L2 vs SMEM) — but the math derived here carries over verbatim. The hardware-specific layout is Phase 24's job.
PagedAttention implementation. Phase 22 previews the problem (fragmentation under variable-length batching). The implementation is Phase 27.
Continuous batching. Phase 33 (serving infrastructure). Phase 22 is single-stream.
Speculative decoding. Phase 36 — requires multiple model heads or a draft model.
Cache quantization. Phase 26 keeps the cache in fp32 (matching the rest of MiniGPT) and only computes what int8 / fp16 would save.
Sliding-window or LRU eviction. Phase 27 territory. Phase 22's cache is unbounded — pre-allocated to S_max and that's it.
PyTorch. First imported in Phase 24. Phase 22 is NumPy only, per CLAUDE.md §0.4.

Phase 22's scope is: one model, one stream, one cache, exact math, no eviction. Everything else is later.

Next phase preview: docs/phase-23-gpu-fundamentals/ — the cloud-GPU pivot. The mental model built here (memory-bound decode, cache as a linear-growth artifact) carries over; only the hardware changes.