English · Español

03 — Residual Connections and the Gradient Highway¶

🇪🇸 Una conexión residual es y = x + f(x): la entrada salta encima de la transformación. Esto da dos cosas: red profunda que entrena (porque el gradiente tiene una autopista que no se evapora) y un paisaje de optimización mucho más suave.

The 2015 idea that changed everything¶

In 2015, He et al. published ResNet. The headline number was a 152-layer convolutional network on ImageNet, when 30+ layers had been the prior practical ceiling. The one change: every block was wrapped in a residual connection:

\[ y = x + f(x) \]

where $f$ is the block's "weight path" — typically two convs with a normalization and an activation. The "skip" — the bare x term on the right — does no work, no learnable parameters, no FLOPs. Yet it was the difference between trainable and untrainable at 100+ layers.

The reason has two halves:

Forward initial condition is identity. If $f$ is initialized small, $f(x) \approx 0$, so $y \approx x$. Stacking $L$ blocks initially computes the identity function. Optimization starts from a good function (not random noise) and improves it.
Backward gradient highway. The Jacobian decomposes:

\[ \frac{\partial y}{\partial x} = I + \frac{\partial f}{\partial x} \]

The identity term always carries the gradient back. Even if $\partial f / \partial x$ has tiny eigenvalues, the gradient through x is never less than the gradient through y. Vanishing gradients are mostly cured.

Modern deep learning runs on these two halves.

Why the identity-init matters (Borja-friendly form)¶

Imagine you stack 100 layers without residuals. At init, $f_L(f_{L-1}(\dots f_1(x)))$ is a 100-fold composition of nearly-random maps. The output is essentially noise. The loss is huge. The gradient is enormous, gets clipped or NaN-ed, training collapses.

Stack 100 residual blocks instead. At init with $f_l \approx 0$, the network output is approximately $x$ itself. The loss is whatever it would be for a 1-layer identity network — usually fine, definitely non-degenerate. The gradient is healthy. Training starts.

Then, slowly, each $f_l$ learns to add a small correction. The function $y(x)$ becomes $x + \sum_l (\text{small contributions})$. The network is parameterized as a residual to the identity, which is a much better starting parameterization than "an arbitrary 100-layer function."

This insight is sometimes phrased as "deep networks learn perturbations to the identity, not arbitrary functions." It's a useful mental model.

The Jacobian / gradient-highway argument formally¶

For a chain $y_L = y_{L-1} + f_L(y_{L-1})$, the total Jacobian is:

\[ \frac{\partial y_L}{\partial y_0} = \prod_{l=1}^{L} \left(I + \frac{\partial f_l}{\partial y_{l-1}}\right) \]

Expanding the product (formally):

\[ = I + \sum_l \frac{\partial f_l}{\partial y_{l-1}} + \sum_{l_1 < l_2} \frac{\partial f_{l_1}}{\partial y_{l_1 - 1}} \cdot \frac{\partial f_{l_2}}{\partial y_{l_2 - 1}} + \dots \]

The first term is the identity — guaranteed nonzero, guaranteed full-rank. The remaining terms add corrections from the learned layers. Crucially: the gradient back to $y_0$ has at minimum a guaranteed unit-Jacobian path (the identity term), regardless of what the learned layers are doing.

Contrast with a non-residual chain $y_L = f_L(f_{L-1}(\dots f_1(y_0)))$:

\[ \frac{\partial y_L}{\partial y_0} = \prod_l \frac{\partial f_l}{\partial y_{l-1}} \]

A product of $L$ Jacobians. If each has spectral norm < 1 (typical for trained networks), the product's norm shrinks geometrically: the gradient at $y_0$ is exponentially small. This is the vanishing gradient problem, and it's why pre-ResNet deep nets needed careful tricks (orthogonal init, LSTMs, etc.) to train.

Pre-LN vs Post-LN — the residual location question¶

The transformer puts a LayerNorm in the residual block. Where?

Post-LN: $$ y = \text{LN}(x + f(x)) $$

The norm wraps the whole residual output. Gradient:

\[ \frac{\partial y}{\partial x} = \frac{\partial \text{LN}}{\partial (x + f(x))} \cdot \left(I + \frac{\partial f}{\partial x}\right) \]

The LayerNorm Jacobian scales the whole residual contribution. The Jacobian of LayerNorm depends on the input's variance and is generally not close to identity — it can amplify or attenuate. Compounded over $L$ blocks, gradient norms drift.

Pre-LN: $$ y = x + f(\text{LN}(x)) $$

The norm is inside the $f$ branch. Gradient:

\[ \frac{\partial y}{\partial x} = I + \frac{\partial (f \circ \text{LN})}{\partial x} \]

Identity is preserved unconditionally. The LayerNorm only affects the learned branch.

Empirical consequence. Original (Post-LN) transformers diverge without learning-rate warmup. Pre-LN transformers train stably without warmup. Modern LLMs are all Pre-LN. Use Pre-LN as the default in everything we build from Phase 17 onward.

(Exception: a few exotic 2023+ architectures revisit Post-LN with extra tricks — Sandwich LN, DeepNorm. Out of scope.)

The γ-zero init trick¶

A subtle Pre-LN trick: if the block's output projection is initialized to zero (or the block's affine norm parameter $\gamma$ is initialized to zero), the entire $f$ branch outputs zero at step 0. The network is exactly identity at init.

Some recent recipes (e.g., NormFormer, FixUp) lean on this for extra-deep stability. Phase 17 may or may not — depends on whether deep models we train need it.

Diagram¶

           Pre-LN block                            Post-LN block
           ────────────                            ─────────────
              x                                       x
              │                                       │
              ├──────► LN ──► f ──► +  ─► y           │
              │                     ▲                 ├──────►  f  ─► +  ─►  LN  ─► y
              │                     │                 │              ▲
              └─────────────────────┘                 └──────────────┘
                  (residual)                              (residual)

The two diagrams look almost identical — and the algebra differs only in where LN sits. The training dynamics differ by orders of magnitude.

What happens if you remove residuals?¶

Lab 03 will run this:

Train a 50-layer MLP without residuals. Plot the gradient norm at layer 1 over training.
Train the same 50-layer MLP with residuals. Same plot.

Expected: without-residuals shows gradient norm at layer 1 dropping below $10^{-7}$ in the first 100 steps (vanishing). With-residuals stays in $[10^{-3}, 10^{-1}]$ throughout.

This is the empirical proof of the gradient-highway argument on Borja's machine.

Non-residual deep nets — what tricks compensated?¶

Before ResNet, people built deep nets via:

Auxiliary losses at intermediate layers (Inception-v1 added two side-classifiers). Forces gradient signal into early layers without residuals.
Highway networks (Srivastava et al. 2015) — explicit gating: $y = T(x) \cdot f(x) + (1 - T(x)) \cdot x$. A learnable interpolation between identity and transform. Residual is "highway with $T = 1$" — simpler, same benefit.
Careful orthogonal init to keep the Jacobian spectrum near unity. Works to ~30 layers; doesn't scale.

After ResNet, all of these were quietly replaced by residuals. Residual is strictly easier and stronger.

A subtle point: residual is not a "shortcut"¶

A common misreading: "residual is a shortcut around the layer; the network learns to use the shortcut when it wants to." Wrong. The residual is always on, not gated. The block is parameterized as $y = x + f(x)$ unconditionally. The network can learn $f(x) = 0$ (effectively bypass the block) or $f(x) =$ something nontrivial; either is fine.

The Highway network is gated; the Residual block is not. The simplification (drop the gate) was a strict improvement.

Where this matters in the curriculum¶

Phase 15 (attention): attention is wrapped in a residual block.
Phase 17 (transformer block): Pre-LN residual is the canonical block.
Phase 21 (training dynamics): the loss landscape under residuals has provably better conditioning (Li et al. 2018).
Phase 24 (Triton / framework): PyTorch's residual blocks are 3 lines of code — but you'll understand the math behind them.

Drill problems¶

Solutions in solutions/03-residuals-ref.md (phase open).

For a 30-block residual stack with each $f_l$'s Jacobian having spectral norm $\rho < 1$, derive an upper bound on the deviation of the total Jacobian from the identity. (Hint: expand the product, use a geometric series.)
Sketch why a residual block with $f(x) = -x$ would be pathological. (Hint: what does $y = x + (-x)$ compute? What does the gradient look like?)
In Pre-LN with $f$ initialized so $f \approx 0$, the loss landscape near init is approximately that of a 1-layer network applied to the input. Argue why this is better than the loss landscape of a 30-layer non-residual network at init.

One-paragraph recap¶

A residual connection $y = x + f(x)$ does two things: it makes the network's initial condition the identity (which is easy to optimize from), and it gives the gradient an unconditional unit-gain path back to early layers (no vanishing). Pre-LN ($y = x + f(\text{LN}(x))$) is the modern default because the norm doesn't sit on the residual path; Post-LN ($y = \text{LN}(x + f(x))$) does, and compounds Jacobian drift over depth. Without residuals, 50-layer MLPs can't train; with residuals, they train trivially.

Next: theory/04-putting-it-together.md.