English · Español
Phase 27 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-27-modern-attention.yaml. Respuestas detrás de bloques<details>.
Source of truth: data/quizzes/phase-27-modern-attention.yaml.
q-27-01 — FlashAttention is exact, not approximate (free)¶
A teammate claims "FlashAttention is an approximation of softmax attention because it tiles the computation." Refute this in one sentence.
Answer
The online softmax recurrence is an algebraic **identity**, not an approximation. Flash's tile-by-tile execution computes the same output as monolithic softmax, up to floating-point round-off no worse than naive.q-27-02 — What stays in SRAM during the Flash inner loop¶
During Flash forward's inner loop on Q tile i and K/V tile j, which quantities live in SRAM (never cross the HBM boundary)?
- The (N, N) materialized logits matrix S
- The running per-row max m_i and sum l_i
- The partial logits tile S_ij of shape (B_r, B_c)
- The full pre-softmax matrix QK^T
Answer
**Choices 2 and 3.** The full (N, N) matrices never exist in Flash; only per-tile partial logits and per-row running m, l vectors stay in SRAM.q-27-03 — GQA KV-cache savings ratio¶
A model has n_heads = 32 query heads and kv_heads = 4. By what factor does GQA reduce the KV-cache bytes-per-token compared to the MHA-equivalent (kv_heads = n_heads)?
- 2×
- 4×
- 8×
- 32×
Answer
**Choice 3 (8×).** The ratio is exactly `kv_heads / n_heads = 4 / 32 = 1/8`.q-27-04 — Why MQA hurts quality¶
MQA (kv_heads = 1) saves the most KV memory of any GQA setting. Why is it not the default for production models despite that win?
- Because MQA requires retraining from scratch.
- Because all query heads then attend to identical K, V positions, collapsing per-head pattern diversity and producing a measurable quality hit.
- Because MQA doubles the FLOPs at inference.
- Because MQA is incompatible with FlashAttention.
Answer
**Choice 2.** With a single shared K, V, the heads can only differ in their Q and W_O projections; the attention *patterns* collapse. Production typically stops at `kv_heads = n_heads / 8`.q-27-05 — Sliding window vs PagedAttention (free)¶
Both sliding-window attention and PagedAttention reduce KV-cache memory footprint. What is the qualitative difference between what they achieve?