English · Español
03 — PagedAttention and Sliding Window¶
🇪🇸 Flash optimiza el kernel. Paged optimiza el almacenamiento del KV cache entre peticiones. Sliding-window cambia la complejidad de O(N²) a O(N·W). Tres optimizaciones independientes, todas argumentos de roofline desde perspectivas distintas.
PagedAttention: KV cache as virtual memory¶
The problem PagedAttention solves¶
During autoregressive generation, each step appends one new K, V row per layer per attention head. For batch=1 this is fine — KV grows linearly. For a server batching N requests of varying lengths:
- Pre-allocating
max_seq_len × n_layers × n_heads × dper slot wastes most of the memory (most requests are short). - Allocating dynamically and resizing causes fragmentation: even if total free memory is sufficient, no contiguous block of the required size exists.
Concrete numbers from the vLLM paper: a 13B-parameter model with max_seq_len=2048 on an A100 (80GB) can serve ~10 concurrent requests with naive pre-allocation. With PagedAttention, the same hardware serves ~30 concurrent requests. 3× throughput from memory management alone, no kernel changes.
The mechanism¶
vLLM treats the KV cache as virtual memory:
- Physical memory is divided into fixed-size pages, each holding
block_sizetokens worth of K (or V) for one layer one head. Defaultblock_size = 16. - Each sequence has a page table mapping its logical positions to physical pages.
- Attention kernels follow page table indirections: instead of
K[t, :](flat), they computeK[page_table[t / block_size]][t % block_size, :].
Properties this gives:
- No external fragmentation. All pages are the same size; the allocator never has to find a contiguous block.
- Copy-on-write for prefix caching. Two sequences sharing a prefix (e.g., the same system prompt) share physical pages; on divergence, only the diverging pages are copied.
- Cheap append. Adding a token allocates one page (if the current page is full) or writes to an existing page. No re-allocation of the whole KV tensor.
The cost¶
Page-table indirection per attention access. On GPUs, this is a non-coalesced memory access pattern — slower than a flat array. The vLLM kernel mitigates by:
- Loading the relevant pages into shared memory once per query block.
- Using
block_size = 16(large enough that the page-table lookups amortize over multiple memory accesses).
Real-world overhead: ~5% per attention call, dwarfed by the throughput gain from better utilization.
Composition with Flash¶
Flash Attention computes one head's output by tiling K, V. PagedAttention stores K, V in pages. Composition: the Flash kernel's "load a B_c × d tile of K" step becomes "look up the page indices, load the pages' rows".
vLLM's actual implementation has a custom kernel (paged_attention_v2) that fuses both ideas. We don't re-implement it in Phase 27 — we read it.
Sliding-window attention¶
The complexity claim¶
Standard attention is O(N²) in time and (for naive implementations) O(N²) in memory. For long contexts (N=32k, 100k, 1M), this is brutal even on modern hardware.
Sliding-window attention restricts each token to attend only to the last W tokens (typically W=4096 for Mistral 7B). Complexity drops to O(N·W) — linear in N, which is the only way long-context inference is tractable.
What changes mechanically¶
In the attention computation, the mask is no longer "causal" (lower triangle) but "causal within a band" (lower triangle clipped to width W). Concretely: position i can only attend to positions j ∈ [max(0, i-W+1), i].
In the Flash kernel: the inner loop over K, V tiles only needs to process tiles within the window. The outer loop is still over Q tiles. For Q_tile i at row i*B_r to (i+1)*B_r - 1, the valid K, V tiles span columns max(0, i*B_r - W) to (i+1)*B_r - 1. Skip the rest.
Composition with Flash¶
Trivially composable. The masking is computed at the tile level (one tile fully inside the window → no per-element mask; one tile straddling the boundary → mask just that boundary). In the kernel: an extra constexpr SLIDING_WINDOW: int parameter and a conditional on which tiles to enter.
Lab 02 (Triton kernel) has sliding window as a stretch goal — not in the DoD, but mentioned as a natural extension.
Prefix caching¶
A specific case where PagedAttention shines: many requests share the same system prompt or few-shot examples. The KV for those tokens is computed once, stored in pages, and reused across requests via the page table.
Mechanism: at request submission, hash the prefix tokens. Look up in a global cache (LRU). If hit, the request's page table starts with pointers to the cached pages; only the suffix tokens trigger new KV computation.
For a server with many short queries sharing a long system prompt, prefix caching can give 5–10× throughput gain on top of PagedAttention's 3×. We mention but don't implement.
Chunked prefill¶
The prefill phase (computing KV for the initial prompt) is the most compute-intensive phase of inference. Chunked prefill processes the prefill in chunks of C tokens at a time, interleaved with decoding steps from other requests in the batch.
The win: while one request is in the slow prefill phase, the GPU isn't idle for other requests' fast decode steps. Throughput rises significantly for mixed batches.
The cost: more complex scheduling; chunking too fine introduces overhead. vLLM tunes C automatically.
Reading exercise: vLLM's block_manager.py¶
Lab 03 asks Borja to read block_manager.py (the file that allocates and frees KV pages) and annotate it. Key questions to answer in the annotations:
- What data structures track free vs. allocated pages?
- How are page tables represented? Per-sequence list of physical block IDs?
- What happens when a sequence needs to grow but no free pages remain? (Hint: eviction.)
- How is copy-on-write triggered? (Hint: when two sequences sharing a page diverge.)
- What's the lifecycle of a page across one request's lifetime?
The reading is graded by the annotation quality, not by re-implementing the file.
Roofline interpretation¶
These three ideas — PagedAttention, sliding window, prefix caching — don't change a single kernel's roofline dot. They change how many requests' attention kernels can run on the same hardware at once (PagedAttention), how big N effectively is per kernel (sliding window), and how often a kernel runs at all (prefix caching).
In other words, they're roofline arguments at the system level, not the kernel level:
- PagedAttention: more requests per GPU → higher aggregate throughput.
- Sliding window: each kernel processes less work → lower per-kernel latency.
- Prefix caching: fewer kernel calls → fewer FLOPs total.
Each is a different "axis" of optimization. Flash + Paged + Sliding + Prefix Cache compose; vLLM does all four.
Drill problems¶
Solutions at phase open in solutions/03-paged-sliding-ref.md.
- A server runs LLaMA-13B with N=2048 max length. Page size = 16 tokens. KV per token (fp16, 40 layers, 40 heads, d=128) =
2 × 40 × 40 × 128 × 2 = 819 KiB. Page size in bytes? How many pages fit in 70 GiB? - Sliding window W=4096 on N=32k. Per-query attention complexity drops from
NtoW. By what factor does the total attention compute for the whole sequence drop? (Hint:N × NtoN × W.) - Prefix caching: 100 requests all share a 500-token system prompt. Without prefix caching, KV is recomputed 100 times. With prefix caching, once. What's the savings in attention prefill compute?
- PagedAttention's page-table lookup adds an extra memory access per attention element. Estimate the latency overhead in cycles assuming the page table fits in L1 cache. Argue whether it matters.
One-paragraph recap¶
PagedAttention treats the KV cache as virtual memory: fixed-size pages, per-sequence page tables, copy-on-write for shared prefixes. This eliminates fragmentation across requests, raising server throughput ~3× without changing any kernel. Sliding-window attention restricts the receptive field to W recent tokens, dropping per-step complexity from O(N) to O(W) — the only way million-token contexts are tractable. Prefix caching reuses KV across requests sharing a prompt. All three compose with Flash Attention at the kernel level; together they're what vLLM does. For Phase 27 we read PagedAttention's allocator in vLLM's block_manager.py rather than re-implement — the algorithmic content is small but the engineering surface is large. The next theory file looks at GQA, MQA, and MLA — orthogonal KV-reduction tricks.
Next: theory/04-gqa-mqa-mla.md.