English · Español
04 — Toward PagedAttention (Preview)¶
🇪🇸 La caché densa con pre-asignación a
S_maxfunciona perfectamente para una sola secuencia, pero se rompe cuando hay muchos usuarios concurrentes con prompts de longitud distinta y duración impredecible. El problema es fragmentación de memoria, no rendimiento. PagedAttention (Fase 27) lo arregla copiando la idea de paginación del sistema operativo.
This page exists to plant a problem statement, not to solve it. Phase 22 implements the dense cache; Phase 27 implements PagedAttention. Knowing why the dense cache breaks down is what makes Phase 27 motivated rather than arbitrary.
What works in Phase 22¶
A single sequence. Pre-allocate \((B, H, S_\text{max}, d_h)\) tensors per layer. Append into slices. Done.
No fragmentation. No copies. No reallocation. Memory cost: \(S_\text{max}\) rows even if we only used \(S_\text{current}\) — but at Phase 22's scale, that overhead is negligible. The grammar MiniGPT decoding "Yesterday I worked" wastes ~60 unused row-slots × 64 bytes × 4 layers ≈ 16 KiB. Below noise.
This is fine for:
- One user at a time.
- Predictable max context.
- Long-running batch jobs (where wasted bytes are amortized).
- Toy experiments.
What breaks at production scale¶
A real serving system handles many concurrent sequences of unpredictable length, arriving and departing at unpredictable times. Three things go wrong with the dense pre-allocated approach:
Problem 1: Memory waste per under-used sequence¶
Imagine \(B = 32\) concurrent users, each with their own cache pre-allocated to S_max = 8192. Total cache budget: \(32 \cdot S_\text{max}\) rows per layer. If the average sequence is only 512 tokens long, 15/16 of the cache is wasted. On a 40 GB GPU with 26 GB available for cache, that's 24 GB of wasted memory you could have used for more concurrent sequences.
You could try to "right-size" each user's \(S_\text{max}\) at session start — but you don't know the eventual length. The user might generate 10 tokens or 10000. Predicting is a separate ML problem you don't want.
Problem 2: Defragmentation when sequences leave¶
User 7 finishes after generating 437 tokens. You free their cache slot. Now you have a hole the size of \(S_\text{max}\) rows in the middle of your dense allocation. If user 33 arrives wanting S_max = 8192, you can fill the hole. If user 33 wants S_max = 4096, you have a 4096-row hole left over that's smaller than any other request will want.
This is external fragmentation. The classical OS solution is virtual memory + paging: split memory into fixed-size pages and let an indirection table map "logical sequence positions" to "physical page locations". Pages can be freed and reused without leaving holes — there's only ever internal fragmentation (slack at the end of the last page of each sequence), which is bounded by page-size minus 1 per sequence.
Problem 3: Variable-length batching breaks SIMD¶
Suppose users A, B, C are concurrently decoding with current cache lengths \(S_A = 200\), \(S_B = 1500\), \(S_C = 60\). The GPU wants to do a batched attention: one giant kernel that processes A, B, C's tokens in parallel.
With a dense cache, the kernel either: - Pads to max length (1500): waste a factor of 1500/200 for A, 1500/60 for C. Total waste 7.6×. - Uses ragged batching: requires extra indexing logic in the kernel and breaks coalesced memory access.
Either way, batching efficiency drops sharply as the variance in sequence lengths grows. Modern serving workloads have enormous length variance (think 5-token completions vs 10000-token generations in the same batch).
The PagedAttention idea (one paragraph, Phase 27 derives it)¶
Borrow paging from OS virtual memory:
- Carve the KV cache into fixed-size blocks of \(B_\text{block}\) tokens each (typical \(B_\text{block} = 16\)).
- Maintain, per sequence, a block table: a list of physical block indices that map "the i-th block of this sequence's logical sequence" to "physical block X in GPU memory".
- To extend a sequence by one token: if the current last block has room, write into it. Otherwise, allocate a fresh physical block and update the block table.
- To free a sequence: free its physical blocks (they go back to a free list).
- The attention kernel takes the block table as an argument; it does the gathers internally.
Properties:
- No external fragmentation. All blocks are the same size; free blocks are interchangeable.
- Internal fragmentation bounded. Each sequence wastes at most \(B_\text{block} - 1\) rows in its last block. With \(B_\text{block} = 16\), that's at most 15 rows per sequence — negligible compared to dense's \(S_\text{max} - S_\text{current}\).
- Variable-length batching becomes natural. Each sequence's blocks are independent; the kernel just walks each sequence's block table.
- Prefix sharing for free. Two sequences with the same prompt can share their early blocks via reference counting — major win for batched eval and beam search.
The kernel-level details (how does a CUDA block index into a gather-list of pointers efficiently? what's the memory-coalescing story?) are Phase 27 and 24 material.
What you should take from Phase 22 about Phase 27¶
When you read "PagedAttention" later:
- You should not be surprised it exists. Dense cache + variable-length serving = fragmentation. The fix is paging. This is mechanical.
- You should understand that its FLOPs and arithmetic intensity are identical to the dense version — paging is a memory layout optimization, not a compute optimization. It enables batching, which raises intensity via the FFN amortization derived in
03-decode-as-memory-bound.md. - You should expect a constant-factor overhead from the block-table indirection (extra gather per kernel) traded against a 2–4× throughput win from higher achievable batch sizes. The vLLM paper reports this exact tradeoff.
What Phase 22 builds toward Phase 27¶
Concretely:
- The
KVCacheclass insrc/minicache/cache.pyhas acursorand aS_maxbound. Phase 27 will subclass this intoPagedKVCachewith ablock_tableinstead of a cursor. The base class's API surface (append,read,reset) should be designed so the subclass is a drop-in replacement. - The attention API change in Phase 15 (taking an optional
cachearg) keeps PagedAttention as a future API surface, not a fork.
This is exactly the kind of forward-pointing API decision the BLUEPRINT.md flags as an open question — see src/minicache/BLUEPRINT.md §Open Questions.
What this page does NOT cover¶
- The actual PagedAttention algorithm or kernel. Phase 27.
- Continuous batching scheduling. Phase 33.
- Prefix sharing reference-counting details. Phase 27 / 36.
Forward references¶
Concepts named here, defined elsewhere:
- PagedAttention math → docs/phase-27-modern-attention/theory/
- CUDA gather kernels → docs/phase-24-cuda-triton/theory/
- Continuous batching → docs/phase-33-inference-serving/
- Prefix sharing / beam search → docs/phase-36-frontier-architectures/
Next: done with theory. Move to lab/00-derive-cache-size.md — a paper exercise to confirm the formula has landed before any code is written.