English · Español
Break 00 — Remove the residual connection from a deep MLP¶
🇪🇸 Rompemos el "gradient highway". Esperamos varios síntomas: pérdida que se queda en NaN, ranking de capas profundas con grad ≈ 0, y curva de entrenamiento que no baja. La práctica te debe convencer de por qué los residuals son el ingrediente que permite redes profundas — no un truco.
Anchors:
LYNX_CORTEX.md§4 / PHASE 10; this phase theory §03 residuals;.claude/commands/break.md.
The break¶
In src/minimodel/nn/blocks.py (the file where Lab 03's 50-layer MLP block lives):
class ResidualBlock(Module):
def __init__(self, d: int) -> None:
super().__init__()
self.norm = RMSNorm(d)
self.fc1 = Linear(d, 4 * d)
self.fc2 = Linear(4 * d, d)
def forward(self, x: Tensor) -> Tensor:
h = self.fc2(self.fc1(self.norm(x)).gelu())
# BUG: removed the residual.
# return x + h <- restore this
return h
Single-line removal. Compose 50 of these into a chain.
Predict, then run¶
The forward-pass variance through a residual stack is:
It grows linearly with depth — bounded if each f_ℓ is variance-preserving. Without the residual:
If gain_ℓ ≈ 1, OK. If gain_ℓ is even slightly off (0.95 or 1.05), the variance multiplies away or explodes by step 50.
Backward-pass: the gradient w.r.t. x_0 is
Without the residual, each factor is ∂f_ℓ/∂x, typically near unit but never exactly 1. Compound 50 of those and you get vanishing or exploding gradients — Phase 14's vanishing-gradient problem revived in MLP form.
With the residual, each factor is I + ∂f_ℓ/∂x — and the identity guarantees a unit-gain path.
Predictions¶
- Loss at step 100 with residuals: decreasing, ~80% of initial.
- Loss at step 100 without residuals: flat near initial loss, or NaN (if init magnitude is mistuned).
∇_{W1}(deepest layer's weight gradient) magnitude with residuals: ~1e-3.∇_{W1}magnitude without residuals:<1e-7(vanishing) or>1e+5(exploding).- Train curves: with-residual reaches >85% val acc in ~500 steps; without-residual either doesn't converge or takes 10× longer.
Write your predictions in learners/borja/phase-10/notes/breaks.md before running.
Observe¶
Diagnostics:
- Per-layer gradient norm — log on every step. Should be
~1for residual, drift to0orinfwithout. - Training-loss curve overlay.
- Histogram of weights after 100 steps — without residuals, most have barely moved (vanishing gradient signature).
Symptom Borja will see¶
nanin training loss within 50 steps OR loss flat-lines near initial value.- The deepest layer's gradient norm reads
0.0orinffrom step 1. - Training never beats a 1-layer MLP baseline.
Hidden cause (one sentence)¶
The residual return x + h was replaced with return h, killing the identity gradient highway and re-enabling vanishing/exploding gradients in a 50-layer stack.
Hint cascade¶
- Print per-layer gradient norms during the first training step. Are they constant across layers, or do they shrink/explode with depth?
- Phase 10 §03 derives
∂y/∂x = I + ∂f/∂xfory = x + f(x). What does theIterm guarantee? What happens ifIis missing? - Compare
ResidualBlock.forward's return statement to the theory derivation in §03.
Fix diff¶
def forward(self, x: Tensor) -> Tensor:
h = self.fc2(self.fc1(self.norm(x)).gelu())
return x + h # restored
Why this teaches the concept¶
He et al. (2015) showed that adding identity skip connections lets you train 152-layer networks; without them, even 20 layers fail. The §A13 task only needs a 2-layer MLP — but Phase 10 forces you to break a 50-layer stack precisely so that you feel why depth needs residuals before Phase 17 stacks 6 transformer blocks (where each block has its own residual highway). The lesson here is identical to the one Phase 17 lab needs.
Next: Phase 11's /break on monolingual BPE training.