English · Español
Phase 17 — Quiz (human-readable mirror)¶
🇪🇸 Espejo legible del canónico
data/quizzes/phase-17-mini-gpt.yaml.
Source: data/quizzes/phase-17-mini-gpt.yaml.
q-17-01 — Per-layer parameter count (single)¶
A Pre-LN transformer block with d_model = d, d_ff = 4d, no biases. Approximate parameters per layer (ignoring norms)?
4 d²8 d²12 d²✓16 d²
Attention:
4·d²(Q,K,V,O). FFN:2·d·d_ff = 8·d². Total ≈12·d². Rule of thumb that matches GPT-3 and LLaMA within 5%.
q-17-02 — Training loss is suspiciously low (single)¶
Mini-GPT trains to loss 1e-4 in 10 steps, well below H(corpus) ≈ 1.5. Sampling produces "
- Learning rate is too high
- Causal mask is not applied (the model can see the future during training) ✓
- Initialization is wrong
- The optimizer is broken
Loss below
H(corpus)is mathematically impossible without leakage. The "peek at the next token" trick fails at inference → BoS collapse.
q-17-03 — Why tied embeddings save parameters (free)¶
Expected to contain: V.
Untied:
2·V·d. Tied:V·d. SavesV·dparams — for GPT-3 (V=50257, d=12288) that's ~620M parameters.
q-17-04 — Pre-LN gradient (single)¶
In y = x + f(LN(x)), the gradient ∂y/∂x is:
- Exactly
I I + ∂f∘LN/∂x✓∂LN/∂x · (I + ∂f/∂x)∂LN/∂x · ∂f/∂x
The residual sum gives
I(identity term) directly; Pre-LN keeps the LN insidef. The identity term guarantees a unit-gain gradient highway.
q-17-05 — Compute the §A13 mini-GPT parameter count (single)¶
V=512, d=64, L=2, d_ff=256, RoPE, tied LM head, no biases.
- ~32K
- ~65K
- ~131K ✓
- ~262K
32,768 + 2·49,280 + 64 ≈ 131,392. See Phase 17 §04.