English · Español
Phase 22 — KV Cache: From Math to Memory¶
Requires: 15 — Attention from Scratch · 21 — Inference Internals & Sampling Teaches:
kv-cache·prefill·decode·memory-bound·arithmetic-intensityJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per §A12. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
Topic: English verb grammar per §A13. Cache demos use prompts like
"Yesterday I"extended to"Yesterday I worked"— the autoregressive decode the cache exists to accelerate.🇪🇸 Sin caché KV, cada token nuevo recalcula la atención sobre todos los anteriores: O(n³) total. Con caché, cada token mira una sola fila contra las claves ya guardadas: O(n²) total, a cambio de memoria que crece linealmente. Esta fase es donde la inferencia deja de ser una broma y empieza a ser un sistema.
Goal¶
Build a dense KV cache in NumPy that:
- Turns autoregressive generation of Phase 17's MiniGPT (verb-grammar-trained) from quadratic-per-token compute into linear-per-token compute.
- Produces byte-identical outputs to the no-cache path (every divergence is a bug, not a feature).
- Has a memory footprint Borja can derive on paper before measuring.
This is the second-most-important derivation phase after Phase 15 (attention itself). The theory pages are dense and worth slow reading.
Read order¶
theory/00-motivation.md— why a cache exists at all; the prefill/decode dichotomy via the worked example"Yesterday I worked".theory/01-prefill-vs-decode.md— the two phases of inference; FLOPs accounting on grammar-corpus prompts.theory/02-memory-cost.md— the size formula2 · L · H · d_h · S · B · s; per-token marginal cost; scaling to real models.theory/03-decode-as-memory-bound.md— arithmetic intensity of single-token attention; why decode is bandwidth-bound; preview of Flash-decoding (Phase 24) and PagedAttention (Phase 27).theory/04-toward-paged-attention.md— short preview of why dense pre-allocation breaks down under variable-length batched serving; problem statement only, solution deferred to Phase 27.lab/00-derive-cache-size.md— paper exercise: derive memory cost for grammar MiniGPT, Llama-2-7B, GPT-3-175B at three context lengths. No code.lab/01-implement-cache.md— implementsrc/minicache/cache.py. Failing tests live intests/test_minicache.py.lab/02-correctness-test.md— property test: with-cache equals without-cache, byte-identical, over 50 grammar prompts × 64 tokens; flagship slot-level dump of"Yesterday I"→"worked".lab/03-cost-curves.md— measure memory + latency vs context length; overlay theory curves; commitexperiments/22-kv-cost/.
solutions/ is empty during pre-write — populated at phase open after Borja's Phase-15 attention API is confirmed.
Definition of Done¶
See PHASE_22_PLAN.md §6. Briefly:
src/minicache/{cache,layout}.pyimplemented; mypy-strict clean; pytest green.experiments/22-kv-cost/shows memory and latency curves within 10% of theoretical predictions.experiments/22-cache-correctness/confirms byte-identical generation with vs without cache over 50 × 64 tokens.experiments/22-yesterday-worked/shows the K, V slot for"I"matches between cached decode and full recompute.- Borja can derive the cache-size formula on a whiteboard from
(L, H, d_h, S, B, s)without looking, and project it to Llama-2-7B and GPT-3.
What this phase intentionally does NOT cover¶
- GPU KV cache. Phase 22 stays CPU + NumPy. The cache moves to GPU memory in Phase 24, and the layout changes (HBM vs L2 vs SMEM) — but the math derived here carries over verbatim. The hardware-specific layout is Phase 24's job.
- PagedAttention implementation. Phase 22 previews the problem (fragmentation under variable-length batching). The implementation is Phase 27.
- Continuous batching. Phase 33 (serving infrastructure). Phase 22 is single-stream.
- Speculative decoding. Phase 36 — requires multiple model heads or a draft model.
- Cache quantization. Phase 26 keeps the cache in fp32 (matching the rest of MiniGPT) and only computes what int8 / fp16 would save.
- Sliding-window or LRU eviction. Phase 27 territory. Phase 22's cache is unbounded — pre-allocated to
S_maxand that's it. - PyTorch. First imported in Phase 24. Phase 22 is NumPy only, per
CLAUDE.md§0.4.
Phase 22's scope is: one model, one stream, one cache, exact math, no eviction. Everything else is later.
Next phase preview: docs/phase-23-gpu-fundamentals/ — the cloud-GPU pivot. The mental model built here (memory-bound decode, cache as a linear-growth artifact) carries over; only the hardware changes.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Efficiently Scaling Transformer Inference — Pope et al. · 2022. the arithmetic of KV cache and decode cost.