English · Español
Phase 15 — Quiz (human-readable mirror)¶
🇪🇸 Espejo legible del canónico
data/quizzes/phase-15-attention.yaml.
Source: data/quizzes/phase-15-attention.yaml.
q-15-01 — Why divide by sqrt(d_k)? (single)¶
- To make the attention rows sum exactly to 1
- To keep the variance of the logits roughly constant as d_k grows ✓
- To compensate for floating-point underflow in fp16
- To match the original transformer's parameter count
Without scaling, logits grow with d_k and softmax saturates. Dividing by sqrt(d_k) preserves O(1) variance and healthy gradients.
q-15-02 — Properties of the causal mask (multi)¶
- It is upper-triangular with -inf above the diagonal ✓
- It is symmetric so each token sees every other token
- It permits a token to attend to itself ✓
- It permits a token to attend to future tokens
Future positions are set to -inf (softmax → 0); the diagonal is kept so each token attends to itself.
q-15-03 — Time complexity of self-attention (free)¶
Expected to contain: n^2.
Computing the N×N attention matrix is O(N²·d). Flash attention and linear variants try to reduce this.
q-15-04 — Derive: softmax variance under unscaled QK^T (single)¶
With Q, K ∼ N(0, 1) i.i.d. and d_k = 64, what is the std of an entry of Q @ K^T?
- 1
- sqrt(64) = 8 ✓
- 64
- log(64) ≈ 4.16
Each entry is a sum of
d_kproducts of N(0,1) variables. Var = d_k = 64; std = 8. Logits with std 8 saturate softmax. Dividing by sqrt(d_k) restores std ≈ 1.
q-15-05 — Find the bug: attention without sqrt(d_k) at d_k = 64 (free)¶
Expected to contain: satur.
Softmax saturates (one entry near 1, rest near 0). Gradients through attention vanish. At small d_k the bug is invisible. See
break/00-break-no-sqrt-dk-scaling.md.