English · Español

03 — PagedAttention and Sliding Window¶

🇪🇸 Flash optimiza el kernel. Paged optimiza el almacenamiento del KV cache entre peticiones. Sliding-window cambia la complejidad de O(N²) a O(N·W). Tres optimizaciones independientes, todas argumentos de roofline desde perspectivas distintas.

PagedAttention: KV cache as virtual memory¶

The problem PagedAttention solves¶

During autoregressive generation, each step appends one new K, V row per layer per attention head. For batch=1 this is fine — KV grows linearly. For a server batching N requests of varying lengths:

Pre-allocating max_seq_len × n_layers × n_heads × d per slot wastes most of the memory (most requests are short).
Allocating dynamically and resizing causes fragmentation: even if total free memory is sufficient, no contiguous block of the required size exists.

Concrete numbers from the vLLM paper: a 13B-parameter model with max_seq_len=2048 on an A100 (80GB) can serve ~10 concurrent requests with naive pre-allocation. With PagedAttention, the same hardware serves ~30 concurrent requests. 3× throughput from memory management alone, no kernel changes.

The mechanism¶

vLLM treats the KV cache as virtual memory:

Physical memory is divided into fixed-size pages, each holding block_size tokens worth of K (or V) for one layer one head. Default block_size = 16.
Each sequence has a page table mapping its logical positions to physical pages.
Attention kernels follow page table indirections: instead of K[t, :] (flat), they compute K[page_table[t / block_size]][t % block_size, :].

Properties this gives:

No external fragmentation. All pages are the same size; the allocator never has to find a contiguous block.
Copy-on-write for prefix caching. Two sequences sharing a prefix (e.g., the same system prompt) share physical pages; on divergence, only the diverging pages are copied.
Cheap append. Adding a token allocates one page (if the current page is full) or writes to an existing page. No re-allocation of the whole KV tensor.

The cost¶

Page-table indirection per attention access. On GPUs, this is a non-coalesced memory access pattern — slower than a flat array. The vLLM kernel mitigates by:

Loading the relevant pages into shared memory once per query block.
Using block_size = 16 (large enough that the page-table lookups amortize over multiple memory accesses).

Real-world overhead: ~5% per attention call, dwarfed by the throughput gain from better utilization.

Composition with Flash¶

Flash Attention computes one head's output by tiling K, V. PagedAttention stores K, V in pages. Composition: the Flash kernel's "load a B_c × d tile of K" step becomes "look up the page indices, load the pages' rows".

vLLM's actual implementation has a custom kernel (paged_attention_v2) that fuses both ideas. We don't re-implement it in Phase 27 — we read it.

Sliding-window attention¶

The complexity claim¶

Standard attention is O(N²) in time and (for naive implementations) O(N²) in memory. For long contexts (N=32k, 100k, 1M), this is brutal even on modern hardware.

Sliding-window attention restricts each token to attend only to the last W tokens (typically W=4096 for Mistral 7B). Complexity drops to O(N·W) — linear in N, which is the only way long-context inference is tractable.

What changes mechanically¶

In the attention computation, the mask is no longer "causal" (lower triangle) but "causal within a band" (lower triangle clipped to width W). Concretely: position i can only attend to positions j ∈ [max(0, i-W+1), i].

In the Flash kernel: the inner loop over K, V tiles only needs to process tiles within the window. The outer loop is still over Q tiles. For Q_tile i at row i*B_r to (i+1)*B_r - 1, the valid K, V tiles span columns max(0, i*B_r - W) to (i+1)*B_r - 1. Skip the rest.

Composition with Flash¶

Trivially composable. The masking is computed at the tile level (one tile fully inside the window → no per-element mask; one tile straddling the boundary → mask just that boundary). In the kernel: an extra constexpr SLIDING_WINDOW: int parameter and a conditional on which tiles to enter.

Lab 02 (Triton kernel) has sliding window as a stretch goal — not in the DoD, but mentioned as a natural extension.

Prefix caching¶

A specific case where PagedAttention shines: many requests share the same system prompt or few-shot examples. The KV for those tokens is computed once, stored in pages, and reused across requests via the page table.

Mechanism: at request submission, hash the prefix tokens. Look up in a global cache (LRU). If hit, the request's page table starts with pointers to the cached pages; only the suffix tokens trigger new KV computation.

For a server with many short queries sharing a long system prompt, prefix caching can give 5–10× throughput gain on top of PagedAttention's 3×. We mention but don't implement.

Chunked prefill¶

The prefill phase (computing KV for the initial prompt) is the most compute-intensive phase of inference. Chunked prefill processes the prefill in chunks of C tokens at a time, interleaved with decoding steps from other requests in the batch.

The win: while one request is in the slow prefill phase, the GPU isn't idle for other requests' fast decode steps. Throughput rises significantly for mixed batches.

The cost: more complex scheduling; chunking too fine introduces overhead. vLLM tunes C automatically.

Reading exercise: vLLM's `block_manager.py`¶

Lab 03 asks Borja to read block_manager.py (the file that allocates and frees KV pages) and annotate it. Key questions to answer in the annotations:

What data structures track free vs. allocated pages?
How are page tables represented? Per-sequence list of physical block IDs?
What happens when a sequence needs to grow but no free pages remain? (Hint: eviction.)
How is copy-on-write triggered? (Hint: when two sequences sharing a page diverge.)
What's the lifecycle of a page across one request's lifetime?

The reading is graded by the annotation quality, not by re-implementing the file.

Roofline interpretation¶

These three ideas — PagedAttention, sliding window, prefix caching — don't change a single kernel's roofline dot. They change how many requests' attention kernels can run on the same hardware at once (PagedAttention), how big N effectively is per kernel (sliding window), and how often a kernel runs at all (prefix caching).

In other words, they're roofline arguments at the system level, not the kernel level:

PagedAttention: more requests per GPU → higher aggregate throughput.
Sliding window: each kernel processes less work → lower per-kernel latency.
Prefix caching: fewer kernel calls → fewer FLOPs total.

Each is a different "axis" of optimization. Flash + Paged + Sliding + Prefix Cache compose; vLLM does all four.

Drill problems¶

Solutions at phase open in solutions/03-paged-sliding-ref.md.

A server runs LLaMA-13B with N=2048 max length. Page size = 16 tokens. KV per token (fp16, 40 layers, 40 heads, d=128) = 2 × 40 × 40 × 128 × 2 = 819 KiB. Page size in bytes? How many pages fit in 70 GiB?
Sliding window W=4096 on N=32k. Per-query attention complexity drops from N to W. By what factor does the total attention compute for the whole sequence drop? (Hint: N × N to N × W.)
Prefix caching: 100 requests all share a 500-token system prompt. Without prefix caching, KV is recomputed 100 times. With prefix caching, once. What's the savings in attention prefill compute?
PagedAttention's page-table lookup adds an extra memory access per attention element. Estimate the latency overhead in cycles assuming the page table fits in L1 cache. Argue whether it matters.

One-paragraph recap¶

PagedAttention treats the KV cache as virtual memory: fixed-size pages, per-sequence page tables, copy-on-write for shared prefixes. This eliminates fragmentation across requests, raising server throughput ~3× without changing any kernel. Sliding-window attention restricts the receptive field to W recent tokens, dropping per-step complexity from O(N) to O(W) — the only way million-token contexts are tractable. Prefix caching reuses KV across requests sharing a prompt. All three compose with Flash Attention at the kernel level; together they're what vLLM does. For Phase 27 we read PagedAttention's allocator in vLLM's block_manager.py rather than re-implement — the algorithmic content is small but the engineering surface is large. The next theory file looks at GQA, MQA, and MLA — orthogonal KV-reduction tricks.

Next: theory/04-gqa-mqa-mla.md.

03 — PagedAttention and Sliding Window¶

PagedAttention: KV cache as virtual memory¶

The problem PagedAttention solves¶

The mechanism¶

The cost¶

Composition with Flash¶

Sliding-window attention¶

The complexity claim¶

What changes mechanically¶

Composition with Flash¶

Prefix caching¶

Chunked prefill¶

Reading exercise: vLLM's block_manager.py¶

Roofline interpretation¶

Drill problems¶

One-paragraph recap¶

Reading exercise: vLLM's `block_manager.py`¶