Skip to content

English · Español

Phase 15 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del canónico data/quizzes/phase-15-attention.yaml.

Source: data/quizzes/phase-15-attention.yaml.


q-15-01 — Why divide by sqrt(d_k)? (single)

  • To make the attention rows sum exactly to 1
  • To keep the variance of the logits roughly constant as d_k grows
  • To compensate for floating-point underflow in fp16
  • To match the original transformer's parameter count

Without scaling, logits grow with d_k and softmax saturates. Dividing by sqrt(d_k) preserves O(1) variance and healthy gradients.


q-15-02 — Properties of the causal mask (multi)

  • It is upper-triangular with -inf above the diagonal
  • It is symmetric so each token sees every other token
  • It permits a token to attend to itself
  • It permits a token to attend to future tokens

Future positions are set to -inf (softmax → 0); the diagonal is kept so each token attends to itself.


q-15-03 — Time complexity of self-attention (free)

Expected to contain: n^2.

Computing the N×N attention matrix is O(N²·d). Flash attention and linear variants try to reduce this.


q-15-04 — Derive: softmax variance under unscaled QK^T (single)

With Q, K ∼ N(0, 1) i.i.d. and d_k = 64, what is the std of an entry of Q @ K^T?

  • 1
  • sqrt(64) = 8
  • 64
  • log(64) ≈ 4.16

Each entry is a sum of d_k products of N(0,1) variables. Var = d_k = 64; std = 8. Logits with std 8 saturate softmax. Dividing by sqrt(d_k) restores std ≈ 1.


q-15-05 — Find the bug: attention without sqrt(d_k) at d_k = 64 (free)

Expected to contain: satur.

Softmax saturates (one entry near 1, rest near 0). Gradients through attention vanish. At small d_k the bug is invisible. See break/00-break-no-sqrt-dk-scaling.md.