Skip to content

English · Español

Phase 10 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del fichero canónico data/quizzes/phase-10-init-norm-residuals.yaml. El portal (Phase 41) consume el YAML; este .md es para repaso rápido.

Source: data/quizzes/phase-10-init-norm-residuals.yaml.


q-10-01 — Why is bf16 preferred over fp16 for training? (single)

Both bf16 and fp16 use 16 total bits. What property of bf16 makes it the default for modern LLM training while fp16 requires loss scaling?

  • bf16 has a longer mantissa, giving more precise gradient values
  • bf16 has fp32's 8-bit exponent, giving fp32-equivalent dynamic range
  • bf16 hardware is faster on every accelerator since 2018
  • bf16 supports subnormals while fp16 does not

bf16 trades mantissa precision (7 bits) for fp32-equivalent dynamic range (8-bit exponent). Gradients need range, not precision. fp16's 5-bit exponent forces loss scaling.


q-10-02 — RMSNorm vs LayerNorm — what is dropped? (multi)

  • The mean subtraction
  • The variance computation
  • The learnable bias β
  • The learnable gain γ

RMSNorm uses RMS = sqrt(mean(x²)) instead of std, so the per-sample mean is never computed nor subtracted, and β is dropped. γ stays. Memory traffic ~60% of LayerNorm with no quality loss.


q-10-03 — Why does Pre-LN allow training without warmup? (free)

Expected to contain: identity.

Pre-LN puts the residual outside the norm, so ∂y/∂x = I + (something). The identity path has guaranteed unit gain at every layer. Post-LN scales the residual by the LayerNorm Jacobian, which compounds badly.


q-10-04 — Find the bug: ε placement in RMSNorm (single)

A learner writes x / (np.sqrt(rms2) + eps) for the RMSNorm denominator. Which input exposes the bug?

  • An input with very large magnitude (rms2 → ∞)
  • An input that is all zeros (rms2 → 0)
  • An input that is exactly the identity matrix
  • An input that contains NaN values

When x is all zeros, sqrt(rms2) → 0 so the buggy denominator becomes eps (tiny). The correct form gives sqrt(eps), the intended variance floor.


q-10-05 — Kaiming gain for GELU (single)

  • Exactly 1 (linear)
  • Exactly sqrt(2)
  • Approximately sqrt(2) — close enough; GELU and ReLU agree on the right tail
  • Approximately sqrt(2/pi)

GELU and ReLU both behave like x for large positive x and like 0 for large negative x. The variance-preserving gain is dominated by the right tail, so sqrt(2) is a fine first-order choice.