English · Español
00 — Why depth is hard, and the three tricks that fix it¶
🇪🇸 Apilar 20 capas en una red ingenuamente diseñada no funciona: o las activaciones explotan, o el gradiente se evapora, o las distribuciones internas se mueven tanto que cada capa entrena contra una entrada distinta. Tres trucos (init, norm, residual) anularon esos tres modos de fallo y son la razón por la que el deep learning moderno funciona.
The thing the textbook plot doesn't show¶
Universal approximation says: a 2-layer MLP with enough hidden units can fit any function. From that theorem, "deep learning" sounds redundant — why not just use very wide 2-layer nets?
The answer is empirical, not theoretical. Wide, shallow nets are sample-inefficient and optimization-resistant compared to deep, narrow nets at fixed parameter count. The deep nets generalize better, train with smaller batches, and produce more compressed, reusable representations. But — and this is the catch — naive deep nets don't train at all.
If you take the 2-layer MLP from Phase 9 and stack it 20 layers deep with random Gaussian init and no normalization, three things go wrong, often simultaneously:
- Forward-pass blow-up. Each layer multiplies activations by a matrix; with bad init, the activation norm doubles every layer. Layer 10 saturates
tanheverywhere or overflows forReLU. Layer 20 is meaningless noise. - Backward-pass vanishing. Gradients are products of Jacobians, layer by layer. If each Jacobian has norm < 1, the layer-1 gradient is tiny after 20 multiplications. The first layers never update.
- Internal covariate shift. Even when forward + backward are healthy at initialization, training one layer changes the distribution that the next layer sees. The next layer was tuned to the previous distribution; now it's tuning against a moving target. Slow.
The three structural tricks of Phase 10 are the three fixes:
| Failure mode | Fix |
|---|---|
| Forward-pass blow-up | Variance-preserving initialization — choose Var(W) = 1/n_in so Var(y) = Var(x). |
| Backward-pass vanishing | Residual connections — y = x + f(x) lets the gradient flow as 1 + ∂f/∂x, never below 1 in the trivial path. |
| Internal covariate shift | Normalization — re-center / re-scale activations to a known distribution at each layer. |
Each trick is, in isolation, a small idea. The combination — Pre-LN residual blocks with Kaiming init — is what makes a 30-layer transformer trainable on a fixed budget. We build them in that order.
What "variance-preserving" actually says¶
Take a layer y = W x with W ∈ R^{n_out × n_in}. Assume W_ij are i.i.d. mean-zero with variance σ², and x is a vector with Var(x_j) = v (same for all j), independent of W. Then by linearity of variance and independence:
If we want Var(y_i) = v (preserve variance across the layer), we need σ² = 1 / n_in. That's Xavier init, for the linear or tanh activation. For ReLU, half the activations are zero on average → effective v halves → we need σ² = 2 / n_in. That's Kaiming init. The whole story of "initialization theory" for MLPs is variations on this 3-line argument.
Theory page 01-initialization.md walks the full derivation with the assumptions made explicit. Borja should be able to redo it on a whiteboard.
What "normalization" actually says¶
A normalization layer takes the activations at some point in the network and rescales them so the next layer always sees a consistent distribution. There are choices:
- BatchNorm. Compute mean and variance over the batch dimension. Each unit is normalized using the batch's statistics. Train-time stats ≠ inference-time stats; running averages bridge them. Standard in vision; modern LLMs don't use it.
- LayerNorm. Compute mean and variance over the feature dimension, per example. No train/inference divergence. Standard in transformers — but expensive (one mean + one variance subtraction + one scale).
- RMSNorm. Drop the mean subtraction. Just divide by RMS (
sqrt(mean(x²))). Half the FLOPs, half the memory traffic, comparable training stability. Standard in modern LLMs (Llama, Mistral, … 2026).
The deeper question — why does removing the mean not hurt — is empirical. RMSNorm papers argue it's the re-scaling that mattered all along; centering was incidental. We discuss the argument in 02-normalization.md without claiming a proof.
What "residual" actually says¶
A residual block is y = x + f(x), where f is some sub-network (e.g., a linear layer + activation). Two things follow:
- Forward identity in the limit. If
fis initialized to output zero (or near zero — see "Pre-LN initγ = 0trick" in03-residuals.md), the block computesy ≈ x. Stacking 100 of them produces a network that, at initialization, is just an identity function. Then training improves the identity by adding small corrections. The optimization landscape is far easier than a network that starts as random noise. - Backward gradient highway.
∂y/∂x = I + ∂f/∂x. The identity term means the gradient always has a route back to earlier layers, regardless of whatfis doing. Vanishing gradients become improbable. This is the single most important architectural insight in modern deep learning, and it took until 2015 (ResNet) to land.
03-residuals.md proves the gradient highway claim and walks through Pre-LN vs Post-LN.
How the phase is sequenced¶
Theory pages → labs → experiments, in this dependency order:
- Theory 01 (init) is read first because labs 00 and 01 need it.
- Lab 00 (variance walk) is a pure observation lab —
for layer in net: print(Var(activations)). See the explosion. - Lab 01 (init ablation) — the headline three-curve experiment. Same architecture, three inits, three loss trajectories.
- Theory 02 (norm) then unlocks lab 02 (norm ablation, same architecture, three norms).
- Theory 03 (residual) unlocks lab 03 (50-layer with-and-without residuals).
- Theory 04 ties them together; no new code, just a synthesis that previews the transformer block.
What "predict before running" looks like¶
By the end of Phase 10, given this config:
"30-layer MLP, hidden dim 256, GeLU activation, RMSNorm Pre-LN, residual on every block, Kaiming init scaled for GeLU. Will it train?"
Borja should be able to answer: yes, and predict the loss trajectory shape (smooth descent, no spike). Given this config:
"30-layer MLP, hidden dim 256, ReLU, no norm, no residual, uniform init
[-1, 1]."
He should answer: no, and predict the failure mode (forward-pass blow-up → NaN by step 50).
If those predictions land — before running — the phase has succeeded.
Next: theory/01-initialization.md.