English · Español
Phase 27 — Modern Attention Optimizations¶
Requires: 15 — Attention from Scratch · 22 — KV Cache: From Math to Memory · 26 — Quantization Deep Dive Teaches:
flash-attention·online-softmax·paged-attention·gqa·mqaJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Atención moderna: Flash, Paged, GQA/MQA, ventana deslizante. Lo unificador es el roofline de la Fase 1 — todas optimizan el byte-count, no el FLOP-count.
Goal¶
Re-derive every "modern attention" optimization as a roofline manipulation: same FLOPs, fewer bytes moved, higher arithmetic intensity. This is the single conceptual lens for the whole phase. By the end Borja can predict — from a one-line description of a new attention variant — where its dot will land on his Phase 1 roofline plot.
The phase produces one custom Triton kernel (Flash Attention forward), one set of MQA/GQA variants, and one annotated reading of vLLM's KV-block allocator. Everything else is derivation and measurement.
Topic context (§A13): all measurements anchor at the verb corpus's 64-token sequence length; FlashAttention's win is invisible at that scale (the whole working set fits in L2), so we also scale Q/K/V tensors to N=2048 synthetically to make the win visible. The two anchor points let Borja see both "small-model reality" and "production-scale stress" on the same roofline.
Module placement: Phase 27 extends src/minimodel/ (canonical layout §2) rather than introducing a new top-level module. Attention is a minimodel concern; optimizing attention belongs there.
Read order¶
theory/00-motivation.md— why attention dominates inference time; the roofline argument.theory/01-online-softmax.md— the algebraic identity that makes Flash possible. Read until you can re-derive it from scratch.theory/02-flash-attention.md— the centrepiece. Derive Flash as a tiling strategy on top of online softmax. Compute the byte-count delta vs naive. The most important theory page in this phase.theory/03-paged-and-sliding.md— PagedAttention's KV-cache paging; sliding-window attention; how they compose with Flash.theory/04-gqa-mqa-mla.md— group/multi-query/multi-latent attention. Three independent KV-reduction tricks.lab/00-online-softmax.md— implement online softmax in pure Python; verify it matches batched softmax.lab/01-flash-bytes.md— derive and measure the bytes-moved delta. Symbolic + empirical.lab/02-flash-triton.md— implement Flash Attention forward in Triton.lab/03-paged-attn-reading.md— annotated read of vLLM'sblock_manager.py.lab/04-mqa-gqa.md— implement MQA/GQA variants of MiniGPT attention; measure KV-cache reduction.
solutions/ is empty during pre-write — populated at phase open after Phase 24's Triton conventions are visible.
Definition of Done¶
See PHASE_27_PLAN.md §6. Briefly:
- Triton Flash forward kernel matches PyTorch reference attention to
1e-3at FP16. - Roofline overlay with naive/Flash/MQA/Paged dots committed at both N=64 (verb sequence) and N=2048 (stress).
- KV-cache size reduction measured for MHA vs GQA vs MQA on the verb sequence.
- Annotated read of vLLM's block manager committed.
src/minimodel/attention_flash.py(Triton Flash forward) implemented (Borja).src/minimodel/attention_mqa_gqa.py(MQA / GQA variants) implemented (Borja).
What this phase intentionally does NOT cover¶
- Flash backward. Backward needs recomputation; substantially harder and out of scope. Defer to a future phase (likely 33).
- vLLM as a deployable service. Phase 31 covers inference engines; this phase reads vLLM as a reference for the paging idea, not as a thing to run.
- Distributed attention (ring, sequence parallelism). Phase 35.
- Mamba / SSMs / linear-attention variants. Not "modern attention" in the §4 sense; potentially a side-quest at Phase 38+.
- Attention-free architectures. Out of scope.
- Multi-Latent Attention (DeepSeek). Covered conceptually in theory 04; not implemented.
Phase 27's scope is understand why modern attention is fast, implement Flash forward, and read PagedAttention. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al. · 2022. the tiled-softmax roofline win you derive.
- 📄 Efficient Memory Management for LLM Serving with PagedAttention — Kwon et al. · 2023. the KV-cache paging behind vLLM.
- 📄 GQA: Training Generalized Multi-Query Transformer Models — Ainslie et al. · 2023. the KV-head sharing modern models ship.