English · Español
Phase 10 — Initialization, Normalization, Residuals¶
Requires: 09 — Tiny MLP & Module Abstraction (
minitorch) Teaches:initialization·xavier·kaiming·layer-norm·rms-norm·residuals·pre-lnJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Tres trucos arquitectónicos hacen entrenable una red profunda: inicialización conservadora de varianza, normalización por capa, y conexiones residuales. Sin ellos, todo lo que viene después (transformers de 30+ capas) es físicamente imposible de entrenar.
Goal¶
Make Borja able to predict, before running an experiment, whether a given (depth × activation × init × norm × residual?) combination will diverge, vanish, or train. The phase's headline plot is three loss curves on the same axes — uniform init diverges, Xavier crawls, Kaiming trains — for the same architecture and same data. After that, the rest of Phase 10 is just adding the next two tricks (norm, residual) on top of correct init.
Read order¶
theory/00-motivation.md— why depth needs these three tricks at all. The forward-pass variance argument.theory/01-initialization.md— derive Xavier (linear/tanh) and Kaiming (ReLU) fromVar(y) = Var(x). Closed-form, with example.theory/02-normalization.md— BatchNorm vs LayerNorm vs RMSNorm. Why RMSNorm wins for LLMs in 2026.theory/03-residuals.md— the gradient highway. Pre-LN vs Post-LN. Why Pre-LN is the modern default.theory/04-putting-it-together.md— combining init + norm + residual. Common interaction failures.lab/00-variance-walk.md— see activations explode under bad init; traceVar(activations)across layers.lab/01-init-ablation.md— run the headline three-curve experiment.lab/02-norm-ablation.md— same architecture, three norm variants.lab/03-residual-depth.md— 50-layer MLP, with and without residuals, gradient norms at layer 1.
solutions/ is empty during pre-write — populated at phase open.
Definition of Done¶
See PHASE_10_PLAN.md §6. Briefly:
- Three init experiments demonstrate divergence vs convergence on identical (data, architecture, optimizer).
- LayerNorm and RMSNorm both train; RMSNorm is faster (committed timing comparison).
- A 50-layer MLP trains only with residuals (committed gradient-norm-at-layer-1 plot).
src/minigrad/nn/{init,norm,residual}.pyexist with passing tests.- You can predict the loss trajectory of a new (depth, init, norm, residual) config without running it.
What this phase intentionally does NOT cover¶
- Attention. Phase 15. Self-attention has its own normalization story — covered there.
- Transformer blocks. Phase 17. We build the components here, assemble there.
- Optimizers beyond SGD/Adam. Phase 9 introduced both; Phase 10 does not extend them.
- BatchNorm for vision / CNNs. Mentioned in theory for vocabulary; not implemented. Modern LLMs don't use it.
- Mixed-precision training and its interaction with norms. Phase 26 (quantization).
Phase 10's scope is the three structural tricks that turn a 2-layer MLP into a deep MLP that still trains. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Deep Residual Learning for Image Recognition — He et al. · 2015. residual connections, the idea that unlocked depth.
- 📄 Layer Normalization — Ba, Kiros, Hinton · 2016. the norm the transformer actually uses.
- 📄 Root Mean Square Layer Normalization — Zhang & Sennrich · 2019. the modern cheaper default (RMSNorm).