Skip to content

English · Español

Phase 10 — Initialization, Normalization, Residuals

Requires: 09 — Tiny MLP & Module Abstraction (minitorch) Teaches: initialization · xavier · kaiming · layer-norm · rms-norm · residuals · pre-ln Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Tres trucos arquitectónicos hacen entrenable una red profunda: inicialización conservadora de varianza, normalización por capa, y conexiones residuales. Sin ellos, todo lo que viene después (transformers de 30+ capas) es físicamente imposible de entrenar.


Goal

Make Borja able to predict, before running an experiment, whether a given (depth × activation × init × norm × residual?) combination will diverge, vanish, or train. The phase's headline plot is three loss curves on the same axes — uniform init diverges, Xavier crawls, Kaiming trains — for the same architecture and same data. After that, the rest of Phase 10 is just adding the next two tricks (norm, residual) on top of correct init.

Read order

  1. theory/00-motivation.md — why depth needs these three tricks at all. The forward-pass variance argument.
  2. theory/01-initialization.md — derive Xavier (linear/tanh) and Kaiming (ReLU) from Var(y) = Var(x). Closed-form, with example.
  3. theory/02-normalization.md — BatchNorm vs LayerNorm vs RMSNorm. Why RMSNorm wins for LLMs in 2026.
  4. theory/03-residuals.md — the gradient highway. Pre-LN vs Post-LN. Why Pre-LN is the modern default.
  5. theory/04-putting-it-together.md — combining init + norm + residual. Common interaction failures.
  6. lab/00-variance-walk.mdsee activations explode under bad init; trace Var(activations) across layers.
  7. lab/01-init-ablation.md — run the headline three-curve experiment.
  8. lab/02-norm-ablation.md — same architecture, three norm variants.
  9. lab/03-residual-depth.md — 50-layer MLP, with and without residuals, gradient norms at layer 1.

solutions/ is empty during pre-write — populated at phase open.

Definition of Done

See PHASE_10_PLAN.md §6. Briefly:

  • Three init experiments demonstrate divergence vs convergence on identical (data, architecture, optimizer).
  • LayerNorm and RMSNorm both train; RMSNorm is faster (committed timing comparison).
  • A 50-layer MLP trains only with residuals (committed gradient-norm-at-layer-1 plot).
  • src/minigrad/nn/{init,norm,residual}.py exist with passing tests.
  • You can predict the loss trajectory of a new (depth, init, norm, residual) config without running it.

What this phase intentionally does NOT cover

  • Attention. Phase 15. Self-attention has its own normalization story — covered there.
  • Transformer blocks. Phase 17. We build the components here, assemble there.
  • Optimizers beyond SGD/Adam. Phase 9 introduced both; Phase 10 does not extend them.
  • BatchNorm for vision / CNNs. Mentioned in theory for vocabulary; not implemented. Modern LLMs don't use it.
  • Mixed-precision training and its interaction with norms. Phase 26 (quantization).

Phase 10's scope is the three structural tricks that turn a 2-layer MLP into a deep MLP that still trains. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.