English · Español

Lab 03 — Residual depth: 50-layer MLP, with and without¶

Goal: see the gradient highway in action. A 50-layer MLP without residuals fails to train; the same architecture with residuals trains fine.

Estimated time: 90–120 minutes.

Prereq: labs 01 and 02 committed; theory/03-residuals.md read.

What you produce¶

A directory experiments/10-residual-depth/ containing:

train.py — training script with --use-residual flag.
losses.json — two trajectories.
grad_norms.json — gradient norm at layer 1 over training, for both variants.
loss_curves.png — two curves.
grad_norm.png — log-y plot of gradient norm at layer 1 over training.
manifest.json.
README.md.

The setup¶

A 50-layer MLP, hidden 256, GeLU activation, Kaiming init (with gelu gain), Pre-LN RMSNorm, on the same toy data as labs 01–02.

Two runs:

No residual. Each block: x → RMSNorm → Linear → GeLU → Linear → next.
With residual. Each block: x → RMSNorm → Linear → GeLU → Linear → (add x) → next.

Otherwise identical: same init, same optimizer, same data, same seed.

Expected:

No residual: loss decreases for ~50 steps, then plateaus. Gradient norm at layer 1 drops below 1e-7 within ~100 steps (vanished).
With residual: smooth loss decrease throughout. Gradient norm at layer 1 stays in [1e-3, 1e-1].

This is the empirical proof of theory 03's gradient-highway argument on Borja's machine.

TODOs¶

Block A — implement the residual wrapper¶

Write src/minigrad/nn/residual.py with a Residual(f) module that computes y = x + f(x).
Gradcheck the residual.
Verify that Residual(lambda x: 0 * x) is exactly the identity.

Block B — build the 50-layer MLP¶

50 blocks. Each block: Pre-LN RMSNorm → Linear(h, 4h) → GeLU → Linear(4h, h).
Wrap each block in Residual for the with-residual variant; leave un-wrapped for the without-residual variant.
Kaiming init scaled for GeLU's effective gain (~1.7).

Block C — gradient norm tracking¶

After loss.backward(), capture np.linalg.norm(model.layers[0].weight.grad) and log to grad_norms.json.
Do this for every step (or every 10 steps if too slow).

Block D — two runs¶

Run without residual. Save losses + grad_norms.
Run with residual. Save losses + grad_norms.
Same seed across runs.

Block E — plot¶

Loss curves (2 lines).
Grad-norm-at-layer-1 plot, log-y axis, 2 lines.
Annotate the step at which without-residual's grad norm drops below 1e-7.

Block F — interpret¶

In README.md:

Did the without-residual run train at all? If yes, what's the slope of its last-200-steps loss?
At what step did its layer-1 gradient effectively vanish? Quote a number.
The with-residual run's gradient norm stays roughly in what range? Match against theory 03's prediction.
What would happen at 100 layers without residuals? (Predict; don't necessarily run.)
Suppose you initialized the last Linear in each block with weight = 0. Predict the initial output of the with-residual model. Why might this initialization be useful in very deep nets?

Constraints¶

Same data seed, same optimizer config, same architecture except for the Residual wrap.
Track grad norm for layer 1 only. Tracking all layers explodes the JSON.
Single thread, performance governor.
mypy --strict on src/minigrad/nn/residual.py.

Stop conditions¶

Done when:

All seven files committed.
The grad-norm plot shows the without-residual run's gradient at layer 1 dropping below 1e-7 within ~100 steps.
The with-residual run trains smoothly to a low loss.
README answers all five Block F questions.

Pitfalls¶

Grad norm doesn't vanish without residuals. Could be that 50 layers isn't enough on your particular activation choice. Try 100. Or your norm placement is wrong — Pre-LN before the entire chain (not per-block) effectively keeps gradients alive.
Residual makes loss worse. Almost certainly an init bug — the residual block's inner output is dominating the identity. Reduce the inner-block init scale.
The two runs produce identical losses. Most likely the --use-residual flag isn't doing what you think. Add a print("Using residual:", use_residual) at the start.
The grad-norm JSON is enormous. Log every 10 steps instead of every step. The plot's still informative.
Memory blow-up at 50 layers. Hidden 256 × 4× expansion × 50 blocks is ~13M params. Should be ~50 MiB. If you're OOM, you have a leak (e.g., not clearing intermediate computation graphs).

Hint of last resort¶

If 90 minutes in and the residual isn't helping: write a 5-line test that constructs a Residual(lambda x: -x) and verifies Residual.forward(x) == 0. If that fails, your residual wrapper is broken. If it passes, your training setup is the issue (e.g., the residual is added outside the block instead of inside).

When to consult `solutions/`¶

After all seven files. Solution: solutions/03-residual-depth-ref.md (phase open).

Phase 10 lab sequence complete. Next phase: docs/phase-11-tokenization-bpe/.