Skip to content

English · Español

Break 00 — Remove the residual connection from a deep MLP

🇪🇸 Rompemos el "gradient highway". Esperamos varios síntomas: pérdida que se queda en NaN, ranking de capas profundas con grad ≈ 0, y curva de entrenamiento que no baja. La práctica te debe convencer de por qué los residuals son el ingrediente que permite redes profundas — no un truco.

Anchors: LYNX_CORTEX.md §4 / PHASE 10; this phase theory §03 residuals; .claude/commands/break.md.


The break

In src/minimodel/nn/blocks.py (the file where Lab 03's 50-layer MLP block lives):

class ResidualBlock(Module):
    def __init__(self, d: int) -> None:
        super().__init__()
        self.norm = RMSNorm(d)
        self.fc1 = Linear(d, 4 * d)
        self.fc2 = Linear(4 * d, d)

    def forward(self, x: Tensor) -> Tensor:
        h = self.fc2(self.fc1(self.norm(x)).gelu())
        # BUG: removed the residual.
        # return x + h   <- restore this
        return h

Single-line removal. Compose 50 of these into a chain.

Predict, then run

The forward-pass variance through a residual stack is:

\[ \mathrm{Var}(x_{L}) \approx \mathrm{Var}(x_0) + \sum_{\ell=1}^{L} \mathrm{Var}(f_{\ell}(x_{\ell-1})) \]

It grows linearly with depth — bounded if each f_ℓ is variance-preserving. Without the residual:

\[ \mathrm{Var}(x_{L}) = \prod_{\ell=1}^{L} (\mathrm{gain}_{\ell})^2 \cdot \mathrm{Var}(x_0) \]

If gain_ℓ ≈ 1, OK. If gain_ℓ is even slightly off (0.95 or 1.05), the variance multiplies away or explodes by step 50.

Backward-pass: the gradient w.r.t. x_0 is

\[ \frac{\partial L}{\partial x_0} = \prod_{\ell=1}^{L} \frac{\partial x_{\ell}}{\partial x_{\ell-1}} \]

Without the residual, each factor is ∂f_ℓ/∂x, typically near unit but never exactly 1. Compound 50 of those and you get vanishing or exploding gradients — Phase 14's vanishing-gradient problem revived in MLP form.

With the residual, each factor is I + ∂f_ℓ/∂x — and the identity guarantees a unit-gain path.

Predictions

  • Loss at step 100 with residuals: decreasing, ~80% of initial.
  • Loss at step 100 without residuals: flat near initial loss, or NaN (if init magnitude is mistuned).
  • ∇_{W1} (deepest layer's weight gradient) magnitude with residuals: ~1e-3.
  • ∇_{W1} magnitude without residuals: <1e-7 (vanishing) or >1e+5 (exploding).
  • Train curves: with-residual reaches >85% val acc in ~500 steps; without-residual either doesn't converge or takes 10× longer.

Write your predictions in learners/borja/phase-10/notes/breaks.md before running.

Observe

just exp 10-residual-depth --tag broken-no-residual

Diagnostics:

  1. Per-layer gradient norm — log on every step. Should be ~1 for residual, drift to 0 or inf without.
  2. Training-loss curve overlay.
  3. Histogram of weights after 100 steps — without residuals, most have barely moved (vanishing gradient signature).

Symptom Borja will see

  • nan in training loss within 50 steps OR loss flat-lines near initial value.
  • The deepest layer's gradient norm reads 0.0 or inf from step 1.
  • Training never beats a 1-layer MLP baseline.

Hidden cause (one sentence)

The residual return x + h was replaced with return h, killing the identity gradient highway and re-enabling vanishing/exploding gradients in a 50-layer stack.

Hint cascade

  1. Print per-layer gradient norms during the first training step. Are they constant across layers, or do they shrink/explode with depth?
  2. Phase 10 §03 derives ∂y/∂x = I + ∂f/∂x for y = x + f(x). What does the I term guarantee? What happens if I is missing?
  3. Compare ResidualBlock.forward's return statement to the theory derivation in §03.

Fix diff

def forward(self, x: Tensor) -> Tensor:
    h = self.fc2(self.fc1(self.norm(x)).gelu())
    return x + h                       # restored

Why this teaches the concept

He et al. (2015) showed that adding identity skip connections lets you train 152-layer networks; without them, even 20 layers fail. The §A13 task only needs a 2-layer MLP — but Phase 10 forces you to break a 50-layer stack precisely so that you feel why depth needs residuals before Phase 17 stacks 6 transformer blocks (where each block has its own residual highway). The lesson here is identical to the one Phase 17 lab needs.


Next: Phase 11's /break on monolingual BPE training.