English · Español

04 — Putting init, norm, and residual together¶

🇪🇸 Los tres trucos no son aditivos; interactúan. Una buena inicialización dentro de un bloque Pre-LN residual con RMSNorm es la receta canónica de 2026. Aquí mostramos cómo encajan, qué combinaciones son redundantes, y cuál es la apariencia de la curva de pérdida cuando todo está bien.

The canonical 2026 block¶

                        Pre-LN block
                       ──────────────
   x  ──┬──────────────────────────►  +  ──► y
        │                              ▲
        │                              │
        └──► RMSNorm ──► f ─────────────┘

   where f is e.g.   Linear ──► GeLU ──► Linear
                     │                   │
                     │                   └── output projection: init γ small (or 0)
                     └── Kaiming init scaled for GeLU's effective gain (~1.7)

That's the assembly. Every modern LLM block looks roughly like this — sometimes with attention as the inner f, sometimes a feedforward sub-block. Phase 17 will instantiate this with attention; Phase 10 builds the components.

Six knobs:

Activation — GeLU (default), SiLU/Swish (Llama), or ReLU (toy). Each has its own gain for the init.
Init — Kaiming with the right gain for the activation.
Norm placement — Pre-LN, always.
Norm type — RMSNorm in modern LLMs; LayerNorm for compatibility / classical research; BatchNorm only for vision.
Residual scale — typically unit (no scaling). Some recipes scale f's output by 1/sqrt(L) for very deep nets (Bachlechner et al., FixUp).
Output projection init — small or zero, so the block starts as identity.

Interactions between the three tricks¶

These don't compose trivially. Some combinations are redundant; some are necessary together.

Init + Norm¶

Bad init + good norm partially compensates: the first norm rescales the explosion before it propagates. But the gradient still degenerates because the weights are bad, even if the activations are clipped. So:

Good init alone: forward fine for ~10 layers; degrades by ~30 layers.
Good norm alone: forward fine forever (norm forces it); backward weight gradients can still misbehave if init was bad enough to cause early non-linearity saturation.
Good init + good norm: clean for any depth tested.

This is why the headline experiment in lab 01 keeps norm constant (off, or LayerNorm) and varies init. The variation must be visible without norm covering for it.

Init + Residual¶

Residual on a badly-initialized stack: the residual path saves the gradient signal, but the forward signal still gets dominated by the learned branch if f is unstable. So the network has a healthy gradient but learns toward a noisy target. Slow.

Residual + identity init for f (output projection ≈ 0): clean. The network starts as identity, gradient is unit, training proceeds smoothly.

Norm + Residual¶

This is the most subtle interaction — and it's where Pre-LN vs Post-LN lives. Re-read theory 03 if needed.

All three¶

Their combination is the canonical block. Anything missing is a regression. Anything extra (e.g., two norms in a block) is bloat.

A worked example — 24-layer MLP¶

Let's build the 24-layer MLP we'd want to train.

Architecture:

Input → Linear(d, h) → Block × 24 → Linear(h, num_classes)

Block: x ──► RMSNorm ──► Linear(h, 4h) ──► GeLU ──► Linear(4h, h) ──► (add) ──► y
                                                                         ▲
       └─────────────────────────────────────────────────────────────────┘

With h = 256. So each block has ~256 × 1024 + 1024 × 256 ≈ 530k params; 24 blocks ≈ 13M params. Reasonable for CPU training on the Phase 12 corpus.

Init:

Inner Linear(h, 4h): Kaiming with GeLU gain ≈ 1.7 → \(\sigma_W^2 = 1.7 \cdot 2 / 256 ≈ 0.013\).
Output Linear(4h, h): same, but optionally scale-down for identity start.
RMSNorm \(\gamma\): init to 1 (standard).

Pre-LN check: norm goes first inside the block; residual sums the original input with the inner output. ✓

Expected training: if Phase 18's optimizer is Adam with LR ~3e-4, loss should decrease smoothly from step 0 with no learning-rate warmup needed. Predict this; verify in Phase 17 when we actually train it.

Common failure interactions¶

Eight things that go subtly wrong:

Kaiming init + tanh activation. The factor-2 is overkill for tanh → activations grow. Use Xavier with tanh.
Xavier init + ReLU activation. Half the variance compared to Kaiming → activations collapse. Use Kaiming with ReLU.
Pre-LN + Post-LN mixed in the same model. Bafflingly common in legacy code. The mix produces inconsistent gradient scaling per block. Pick one.
Initialized γ = 0 in RMSNorm. The output is zero everywhere. The network has no signal. Init γ = 1.
ε outside the sqrt (covered in theory 02). Silent failure on degenerate inputs.
BatchNorm in a transformer. Don't. Train/inference mismatch + tiny effective batches (in autoregressive inference).
No residual on the embedding-to-first-block hop. Most papers do have a residual or projection there. Skipping it makes the first block's gradient leak.
Double normalization — applying LayerNorm at both the block input and the block output (i.e., Pre-LN and Post-LN). Bloat; ill-conditioned.

Each of these is a quiz target.

The "predict loss curve" exercise¶

After Phase 10, given a config, you should be able to sketch the loss curve before training.

Config A: "12-layer MLP, ReLU, Kaiming init, no norm, no residual, batch 64, LR 1e-3, 1000 steps."

Prediction: loss drops fast for ~200 steps, then plateaus (gradient at layer 1 vanishes). Final loss > random, < trained. Not divergent, not optimal.

Config B: "12-layer MLP, ReLU, uniform[-0.5, 0.5] init, no norm, no residual, batch 64, LR 1e-3, 1000 steps."

Prediction: loss explodes to NaN within 5–20 steps (forward-pass blow-up). Or: clips somewhere and then plateaus at near-random loss.

Config C: "30-layer MLP, GeLU, Kaiming(gain=1.7) init, RMSNorm Pre-LN, residual, batch 64, LR 3e-4, 1000 steps."

Prediction: smooth monotonic decrease, no spike, no warmup needed. Final loss close to a 2-layer baseline if the task is easy, lower if the task is hard enough to need depth.

If Borja can issue these three predictions confidently, Phase 10 has succeeded.

A note on what we deliberately don't cover¶

Optimizer interactions (AdamW vs SGD-momentum) — Phase 9 ground, Phase 19 deepens.
Learning rate schedules (warmup, cosine decay) — Phase 19. Pre-LN reduces the need for warmup but doesn't replace schedule entirely.
Weight decay interactions with normalization. Subtle. Phase 19.
Gradient clipping. Phase 19.
Mixed precision. Phase 26.

Each of those interacts with init/norm/residual in non-trivial ways. We name them now so they don't surprise later.

One-paragraph recap¶

The three tricks aren't independent. Good init keeps the forward signal sane before the norm has to clean it; the norm keeps it sane over training as weights drift; the residual keeps the gradient flowing back regardless of what the learned branch does. The canonical 2026 block is Pre-LN RMSNorm + Kaiming-scaled init + residual + GeLU/SiLU. Variations on this recipe are all minor; the recipe itself is what makes deep LLMs trainable. Going forward, every Phase 10+ architecture in the curriculum uses some version of it.

Next: lab/00-variance-walk.md.