English · Español

Phase 15 — Quiz (human-readable mirror)¶

🇪🇸 Espejo legible del canónico data/quizzes/phase-15-attention.yaml.

Source: data/quizzes/phase-15-attention.yaml.

q-15-01 — Why divide by sqrt(d_k)? (single)¶

To make the attention rows sum exactly to 1
To keep the variance of the logits roughly constant as d_k grows ✓
To compensate for floating-point underflow in fp16
To match the original transformer's parameter count

Without scaling, logits grow with d_k and softmax saturates. Dividing by sqrt(d_k) preserves O(1) variance and healthy gradients.

q-15-02 — Properties of the causal mask (multi)¶

It is upper-triangular with -inf above the diagonal ✓
It is symmetric so each token sees every other token
It permits a token to attend to itself ✓
It permits a token to attend to future tokens

Future positions are set to -inf (softmax → 0); the diagonal is kept so each token attends to itself.

q-15-03 — Time complexity of self-attention (free)¶

Expected to contain: n^2.

Computing the N×N attention matrix is O(N²·d). Flash attention and linear variants try to reduce this.

q-15-04 — Derive: softmax variance under unscaled QK^T (single)¶

With Q, K ∼ N(0, 1) i.i.d. and d_k = 64, what is the std of an entry of Q @ K^T?

1
sqrt(64) = 8 ✓
64
log(64) ≈ 4.16

Each entry is a sum of d_k products of N(0,1) variables. Var = d_k = 64; std = 8. Logits with std 8 saturate softmax. Dividing by sqrt(d_k) restores std ≈ 1.

q-15-05 — Find the bug: attention without sqrt(d_k) at d_k = 64 (free)¶

Expected to contain: satur.

Softmax saturates (one entry near 1, rest near 0). Gradients through attention vanish. At small d_k the bug is invisible. See break/00-break-no-sqrt-dk-scaling.md.