English · Español

Phase 27 — Modern Attention Optimizations¶

Requires: 15 — Attention from Scratch · 22 — KV Cache: From Math to Memory · 26 — Quantization Deep Dive Teaches: flash-attention · online-softmax · paged-attention · gqa · mqa Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Atención moderna: Flash, Paged, GQA/MQA, ventana deslizante. Lo unificador es el roofline de la Fase 1 — todas optimizan el byte-count, no el FLOP-count.

Goal¶

Re-derive every "modern attention" optimization as a roofline manipulation: same FLOPs, fewer bytes moved, higher arithmetic intensity. This is the single conceptual lens for the whole phase. By the end Borja can predict — from a one-line description of a new attention variant — where its dot will land on his Phase 1 roofline plot.

The phase produces one custom Triton kernel (Flash Attention forward), one set of MQA/GQA variants, and one annotated reading of vLLM's KV-block allocator. Everything else is derivation and measurement.

Topic context (§A13): all measurements anchor at the verb corpus's 64-token sequence length; FlashAttention's win is invisible at that scale (the whole working set fits in L2), so we also scale Q/K/V tensors to N=2048 synthetically to make the win visible. The two anchor points let Borja see both "small-model reality" and "production-scale stress" on the same roofline.

Module placement: Phase 27 extends src/minimodel/ (canonical layout §2) rather than introducing a new top-level module. Attention is a minimodel concern; optimizing attention belongs there.

Read order¶

theory/00-motivation.md — why attention dominates inference time; the roofline argument.
theory/01-online-softmax.md — the algebraic identity that makes Flash possible. Read until you can re-derive it from scratch.
theory/02-flash-attention.md — the centrepiece. Derive Flash as a tiling strategy on top of online softmax. Compute the byte-count delta vs naive. The most important theory page in this phase.
theory/03-paged-and-sliding.md — PagedAttention's KV-cache paging; sliding-window attention; how they compose with Flash.
theory/04-gqa-mqa-mla.md — group/multi-query/multi-latent attention. Three independent KV-reduction tricks.
lab/00-online-softmax.md — implement online softmax in pure Python; verify it matches batched softmax.
lab/01-flash-bytes.md — derive and measure the bytes-moved delta. Symbolic + empirical.
lab/02-flash-triton.md — implement Flash Attention forward in Triton.
lab/03-paged-attn-reading.md — annotated read of vLLM's block_manager.py.
lab/04-mqa-gqa.md — implement MQA/GQA variants of MiniGPT attention; measure KV-cache reduction.

solutions/ is empty during pre-write — populated at phase open after Phase 24's Triton conventions are visible.

Definition of Done¶

See PHASE_27_PLAN.md §6. Briefly:

Triton Flash forward kernel matches PyTorch reference attention to 1e-3 at FP16.
Roofline overlay with naive/Flash/MQA/Paged dots committed at both N=64 (verb sequence) and N=2048 (stress).
KV-cache size reduction measured for MHA vs GQA vs MQA on the verb sequence.
Annotated read of vLLM's block manager committed.
src/minimodel/attention_flash.py (Triton Flash forward) implemented (Borja).
src/minimodel/attention_mqa_gqa.py (MQA / GQA variants) implemented (Borja).

What this phase intentionally does NOT cover¶

Flash backward. Backward needs recomputation; substantially harder and out of scope. Defer to a future phase (likely 33).
vLLM as a deployable service. Phase 31 covers inference engines; this phase reads vLLM as a reference for the paging idea, not as a thing to run.
Distributed attention (ring, sequence parallelism). Phase 35.
Mamba / SSMs / linear-attention variants. Not "modern attention" in the §4 sense; potentially a side-quest at Phase 38+.
Attention-free architectures. Out of scope.
Multi-Latent Attention (DeepSeek). Covered conceptually in theory 04; not implemented.

Phase 27's scope is understand why modern attention is fast, implement Flash forward, and read PagedAttention. Nothing more.