English · Español

Phase 17 — Quiz (human-readable mirror)¶

🇪🇸 Espejo legible del canónico data/quizzes/phase-17-mini-gpt.yaml.

Source: data/quizzes/phase-17-mini-gpt.yaml.

q-17-01 — Per-layer parameter count (single)¶

A Pre-LN transformer block with d_model = d, d_ff = 4d, no biases. Approximate parameters per layer (ignoring norms)?

Attention: 4·d² (Q,K,V,O). FFN: 2·d·d_ff = 8·d². Total ≈ 12·d². Rule of thumb that matches GPT-3 and LLaMA within 5%.

Mini-GPT trains to loss 1e-4 in 10 steps, well below H(corpus) ≈ 1.5. Sampling produces " ...". Most likely cause?

Loss below H(corpus) is mathematically impossible without leakage. The "peek at the next token" trick fails at inference → BoS collapse.

Expected to contain: V.

Untied: 2·V·d. Tied: V·d. Saves V·d params — for GPT-3 (V=50257, d=12288) that's ~620M parameters.

In y = x + f(LN(x)), the gradient ∂y/∂x is:

The residual sum gives I (identity term) directly; Pre-LN keeps the LN inside f. The identity term guarantees a unit-gain gradient highway.

V=512, d=64, L=2, d_ff=256, RoPE, tied LM head, no biases.

32,768 + 2·49,280 + 64 ≈ 131,392. See Phase 17 §04.