Skip to content

English · Español

Phase 17 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del canónico data/quizzes/phase-17-mini-gpt.yaml.

Source: data/quizzes/phase-17-mini-gpt.yaml.


q-17-01 — Per-layer parameter count (single)

A Pre-LN transformer block with d_model = d, d_ff = 4d, no biases. Approximate parameters per layer (ignoring norms)?

  • 4 d²
  • 8 d²
  • 12 d²
  • 16 d²

Attention: 4·d² (Q,K,V,O). FFN: 2·d·d_ff = 8·d². Total ≈ 12·d². Rule of thumb that matches GPT-3 and LLaMA within 5%.


q-17-02 — Training loss is suspiciously low (single)

Mini-GPT trains to loss 1e-4 in 10 steps, well below H(corpus) ≈ 1.5. Sampling produces " ...". Most likely cause?

  • Learning rate is too high
  • Causal mask is not applied (the model can see the future during training)
  • Initialization is wrong
  • The optimizer is broken

Loss below H(corpus) is mathematically impossible without leakage. The "peek at the next token" trick fails at inference → BoS collapse.


q-17-03 — Why tied embeddings save parameters (free)

Expected to contain: V.

Untied: 2·V·d. Tied: V·d. Saves V·d params — for GPT-3 (V=50257, d=12288) that's ~620M parameters.


q-17-04 — Pre-LN gradient (single)

In y = x + f(LN(x)), the gradient ∂y/∂x is:

  • Exactly I
  • I + ∂f∘LN/∂x
  • ∂LN/∂x · (I + ∂f/∂x)
  • ∂LN/∂x · ∂f/∂x

The residual sum gives I (identity term) directly; Pre-LN keeps the LN inside f. The identity term guarantees a unit-gain gradient highway.


q-17-05 — Compute the §A13 mini-GPT parameter count (single)

V=512, d=64, L=2, d_ff=256, RoPE, tied LM head, no biases.

  • ~32K
  • ~65K
  • ~131K
  • ~262K

32,768 + 2·49,280 + 64 ≈ 131,392. See Phase 17 §04.