English · Español

Phase 27 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-27-modern-attention.yaml. Respuestas detrás de bloques <details>.

Source of truth: data/quizzes/phase-27-modern-attention.yaml.

q-27-01 — FlashAttention is exact, not approximate (free)¶

A teammate claims "FlashAttention is an approximation of softmax attention because it tiles the computation." Refute this in one sentence.

Answer

The online softmax recurrence is an algebraic **identity**, not an approximation. Flash's tile-by-tile execution computes the same output as monolithic softmax, up to floating-point round-off no worse than naive.

q-27-02 — What stays in SRAM during the Flash inner loop¶

During Flash forward's inner loop on Q tile i and K/V tile j, which quantities live in SRAM (never cross the HBM boundary)?

The (N, N) materialized logits matrix S
The running per-row max m_i and sum l_i
The partial logits tile S_ij of shape (B_r, B_c)
The full pre-softmax matrix QK^T

Answer

**Choices 2 and 3.** The full (N, N) matrices never exist in Flash; only per-tile partial logits and per-row running m, l vectors stay in SRAM.

q-27-03 — GQA KV-cache savings ratio¶

A model has n_heads = 32 query heads and kv_heads = 4. By what factor does GQA reduce the KV-cache bytes-per-token compared to the MHA-equivalent (kv_heads = n_heads)?

2×
4×
8×
32×

Answer

**Choice 3 (8×).** The ratio is exactly `kv_heads / n_heads = 4 / 32 = 1/8`.

q-27-04 — Why MQA hurts quality¶

MQA (kv_heads = 1) saves the most KV memory of any GQA setting. Why is it not the default for production models despite that win?

Because MQA requires retraining from scratch.
Because all query heads then attend to identical K, V positions, collapsing per-head pattern diversity and producing a measurable quality hit.
Because MQA doubles the FLOPs at inference.
Because MQA is incompatible with FlashAttention.

Answer

**Choice 2.** With a single shared K, V, the heads can only differ in their Q and W_O projections; the attention *patterns* collapse. Production typically stops at `kv_heads = n_heads / 8`.

q-27-05 — Sliding window vs PagedAttention (free)¶

Both sliding-window attention and PagedAttention reduce KV-cache memory footprint. What is the qualitative difference between what they achieve?

Answer

**Sliding window** *drops* history beyond `W` tokens — bounded context per layer, exact within the window. **PagedAttention** keeps all history but stores it in non-contiguous pages so memory is allocated lazily and can be shared across sequences — no information loss, just smarter allocation.