English · Español

00 — Why depth is hard, and the three tricks that fix it¶

🇪🇸 Apilar 20 capas en una red ingenuamente diseñada no funciona: o las activaciones explotan, o el gradiente se evapora, o las distribuciones internas se mueven tanto que cada capa entrena contra una entrada distinta. Tres trucos (init, norm, residual) anularon esos tres modos de fallo y son la razón por la que el deep learning moderno funciona.

The thing the textbook plot doesn't show¶

Universal approximation says: a 2-layer MLP with enough hidden units can fit any function. From that theorem, "deep learning" sounds redundant — why not just use very wide 2-layer nets?

The answer is empirical, not theoretical. Wide, shallow nets are sample-inefficient and optimization-resistant compared to deep, narrow nets at fixed parameter count. The deep nets generalize better, train with smaller batches, and produce more compressed, reusable representations. But — and this is the catch — naive deep nets don't train at all.

If you take the 2-layer MLP from Phase 9 and stack it 20 layers deep with random Gaussian init and no normalization, three things go wrong, often simultaneously:

Forward-pass blow-up. Each layer multiplies activations by a matrix; with bad init, the activation norm doubles every layer. Layer 10 saturates tanh everywhere or overflows for ReLU. Layer 20 is meaningless noise.
Backward-pass vanishing. Gradients are products of Jacobians, layer by layer. If each Jacobian has norm < 1, the layer-1 gradient is tiny after 20 multiplications. The first layers never update.
Internal covariate shift. Even when forward + backward are healthy at initialization, training one layer changes the distribution that the next layer sees. The next layer was tuned to the previous distribution; now it's tuning against a moving target. Slow.

The three structural tricks of Phase 10 are the three fixes:

Failure mode	Fix
Forward-pass blow-up	Variance-preserving initialization — choose `Var(W) = 1/n_in` so `Var(y) = Var(x)`.
Backward-pass vanishing	Residual connections — `y = x + f(x)` lets the gradient flow as `1 + ∂f/∂x`, never below 1 in the trivial path.
Internal covariate shift	Normalization — re-center / re-scale activations to a known distribution at each layer.

Each trick is, in isolation, a small idea. The combination — Pre-LN residual blocks with Kaiming init — is what makes a 30-layer transformer trainable on a fixed budget. We build them in that order.

What "variance-preserving" actually says¶

Take a layer y = W x with W ∈ R^{n_out × n_in}. Assume W_ij are i.i.d. mean-zero with variance σ², and x is a vector with Var(x_j) = v (same for all j), independent of W. Then by linearity of variance and independence:

Var(y_i) = Var(Σ_j W_ij x_j) = Σ_j Var(W_ij x_j) = n_in × σ² × v

If we want Var(y_i) = v (preserve variance across the layer), we need σ² = 1 / n_in. That's Xavier init, for the linear or tanh activation. For ReLU, half the activations are zero on average → effective v halves → we need σ² = 2 / n_in. That's Kaiming init. The whole story of "initialization theory" for MLPs is variations on this 3-line argument.

Theory page 01-initialization.md walks the full derivation with the assumptions made explicit. Borja should be able to redo it on a whiteboard.

What "normalization" actually says¶

A normalization layer takes the activations at some point in the network and rescales them so the next layer always sees a consistent distribution. There are choices:

BatchNorm. Compute mean and variance over the batch dimension. Each unit is normalized using the batch's statistics. Train-time stats ≠ inference-time stats; running averages bridge them. Standard in vision; modern LLMs don't use it.
LayerNorm. Compute mean and variance over the feature dimension, per example. No train/inference divergence. Standard in transformers — but expensive (one mean + one variance subtraction + one scale).
RMSNorm. Drop the mean subtraction. Just divide by RMS (sqrt(mean(x²))). Half the FLOPs, half the memory traffic, comparable training stability. Standard in modern LLMs (Llama, Mistral, … 2026).

The deeper question — why does removing the mean not hurt — is empirical. RMSNorm papers argue it's the re-scaling that mattered all along; centering was incidental. We discuss the argument in 02-normalization.md without claiming a proof.

What "residual" actually says¶

A residual block is y = x + f(x), where f is some sub-network (e.g., a linear layer + activation). Two things follow:

Forward identity in the limit. If f is initialized to output zero (or near zero — see "Pre-LN init γ = 0 trick" in 03-residuals.md), the block computes y ≈ x. Stacking 100 of them produces a network that, at initialization, is just an identity function. Then training improves the identity by adding small corrections. The optimization landscape is far easier than a network that starts as random noise.
Backward gradient highway. ∂y/∂x = I + ∂f/∂x. The identity term means the gradient always has a route back to earlier layers, regardless of what f is doing. Vanishing gradients become improbable. This is the single most important architectural insight in modern deep learning, and it took until 2015 (ResNet) to land.

03-residuals.md proves the gradient highway claim and walks through Pre-LN vs Post-LN.

How the phase is sequenced¶

Theory pages → labs → experiments, in this dependency order:

Theory 01 (init) is read first because labs 00 and 01 need it.
Lab 00 (variance walk) is a pure observation lab — for layer in net: print(Var(activations)). See the explosion.
Lab 01 (init ablation) — the headline three-curve experiment. Same architecture, three inits, three loss trajectories.
Theory 02 (norm) then unlocks lab 02 (norm ablation, same architecture, three norms).
Theory 03 (residual) unlocks lab 03 (50-layer with-and-without residuals).
Theory 04 ties them together; no new code, just a synthesis that previews the transformer block.

What "predict before running" looks like¶

By the end of Phase 10, given this config:

"30-layer MLP, hidden dim 256, GeLU activation, RMSNorm Pre-LN, residual on every block, Kaiming init scaled for GeLU. Will it train?"

Borja should be able to answer: yes, and predict the loss trajectory shape (smooth descent, no spike). Given this config:

"30-layer MLP, hidden dim 256, ReLU, no norm, no residual, uniform init [-1, 1]."

He should answer: no, and predict the failure mode (forward-pass blow-up → NaN by step 50).

If those predictions land — before running — the phase has succeeded.

Next: theory/01-initialization.md.