Skip to content

English · Español

03 — Residual Connections and the Gradient Highway

🇪🇸 Una conexión residual es y = x + f(x): la entrada salta encima de la transformación. Esto da dos cosas: red profunda que entrena (porque el gradiente tiene una autopista que no se evapora) y un paisaje de optimización mucho más suave.


The 2015 idea that changed everything

In 2015, He et al. published ResNet. The headline number was a 152-layer convolutional network on ImageNet, when 30+ layers had been the prior practical ceiling. The one change: every block was wrapped in a residual connection:

\[ y = x + f(x) \]

where \(f\) is the block's "weight path" — typically two convs with a normalization and an activation. The "skip" — the bare x term on the right — does no work, no learnable parameters, no FLOPs. Yet it was the difference between trainable and untrainable at 100+ layers.

The reason has two halves:

  1. Forward initial condition is identity. If \(f\) is initialized small, \(f(x) \approx 0\), so \(y \approx x\). Stacking \(L\) blocks initially computes the identity function. Optimization starts from a good function (not random noise) and improves it.
  2. Backward gradient highway. The Jacobian decomposes:
\[ \frac{\partial y}{\partial x} = I + \frac{\partial f}{\partial x} \]

The identity term always carries the gradient back. Even if \(\partial f / \partial x\) has tiny eigenvalues, the gradient through x is never less than the gradient through y. Vanishing gradients are mostly cured.

Modern deep learning runs on these two halves.

Why the identity-init matters (Borja-friendly form)

Imagine you stack 100 layers without residuals. At init, \(f_L(f_{L-1}(\dots f_1(x)))\) is a 100-fold composition of nearly-random maps. The output is essentially noise. The loss is huge. The gradient is enormous, gets clipped or NaN-ed, training collapses.

Stack 100 residual blocks instead. At init with \(f_l \approx 0\), the network output is approximately \(x\) itself. The loss is whatever it would be for a 1-layer identity network — usually fine, definitely non-degenerate. The gradient is healthy. Training starts.

Then, slowly, each \(f_l\) learns to add a small correction. The function \(y(x)\) becomes \(x + \sum_l (\text{small contributions})\). The network is parameterized as a residual to the identity, which is a much better starting parameterization than "an arbitrary 100-layer function."

This insight is sometimes phrased as "deep networks learn perturbations to the identity, not arbitrary functions." It's a useful mental model.

The Jacobian / gradient-highway argument formally

For a chain \(y_L = y_{L-1} + f_L(y_{L-1})\), the total Jacobian is:

\[ \frac{\partial y_L}{\partial y_0} = \prod_{l=1}^{L} \left(I + \frac{\partial f_l}{\partial y_{l-1}}\right) \]

Expanding the product (formally):

\[ = I + \sum_l \frac{\partial f_l}{\partial y_{l-1}} + \sum_{l_1 < l_2} \frac{\partial f_{l_1}}{\partial y_{l_1 - 1}} \cdot \frac{\partial f_{l_2}}{\partial y_{l_2 - 1}} + \dots \]

The first term is the identity — guaranteed nonzero, guaranteed full-rank. The remaining terms add corrections from the learned layers. Crucially: the gradient back to \(y_0\) has at minimum a guaranteed unit-Jacobian path (the identity term), regardless of what the learned layers are doing.

Contrast with a non-residual chain \(y_L = f_L(f_{L-1}(\dots f_1(y_0)))\):

\[ \frac{\partial y_L}{\partial y_0} = \prod_l \frac{\partial f_l}{\partial y_{l-1}} \]

A product of \(L\) Jacobians. If each has spectral norm < 1 (typical for trained networks), the product's norm shrinks geometrically: the gradient at \(y_0\) is exponentially small. This is the vanishing gradient problem, and it's why pre-ResNet deep nets needed careful tricks (orthogonal init, LSTMs, etc.) to train.

Pre-LN vs Post-LN — the residual location question

The transformer puts a LayerNorm in the residual block. Where?

Post-LN: $$ y = \text{LN}(x + f(x)) $$

The norm wraps the whole residual output. Gradient:

\[ \frac{\partial y}{\partial x} = \frac{\partial \text{LN}}{\partial (x + f(x))} \cdot \left(I + \frac{\partial f}{\partial x}\right) \]

The LayerNorm Jacobian scales the whole residual contribution. The Jacobian of LayerNorm depends on the input's variance and is generally not close to identity — it can amplify or attenuate. Compounded over \(L\) blocks, gradient norms drift.

Pre-LN: $$ y = x + f(\text{LN}(x)) $$

The norm is inside the \(f\) branch. Gradient:

\[ \frac{\partial y}{\partial x} = I + \frac{\partial (f \circ \text{LN})}{\partial x} \]

Identity is preserved unconditionally. The LayerNorm only affects the learned branch.

Empirical consequence. Original (Post-LN) transformers diverge without learning-rate warmup. Pre-LN transformers train stably without warmup. Modern LLMs are all Pre-LN. Use Pre-LN as the default in everything we build from Phase 17 onward.

(Exception: a few exotic 2023+ architectures revisit Post-LN with extra tricks — Sandwich LN, DeepNorm. Out of scope.)

The γ-zero init trick

A subtle Pre-LN trick: if the block's output projection is initialized to zero (or the block's affine norm parameter \(\gamma\) is initialized to zero), the entire \(f\) branch outputs zero at step 0. The network is exactly identity at init.

Some recent recipes (e.g., NormFormer, FixUp) lean on this for extra-deep stability. Phase 17 may or may not — depends on whether deep models we train need it.

Diagram

           Pre-LN block                            Post-LN block
           ────────────                            ─────────────
              x                                       x
              │                                       │
              ├──────► LN ──► f ──► +  ─► y           │
              │                     ▲                 ├──────►  f  ─► +  ─►  LN  ─► y
              │                     │                 │              ▲
              └─────────────────────┘                 └──────────────┘
                  (residual)                              (residual)

The two diagrams look almost identical — and the algebra differs only in where LN sits. The training dynamics differ by orders of magnitude.

What happens if you remove residuals?

Lab 03 will run this:

  1. Train a 50-layer MLP without residuals. Plot the gradient norm at layer 1 over training.
  2. Train the same 50-layer MLP with residuals. Same plot.

Expected: without-residuals shows gradient norm at layer 1 dropping below \(10^{-7}\) in the first 100 steps (vanishing). With-residuals stays in \([10^{-3}, 10^{-1}]\) throughout.

This is the empirical proof of the gradient-highway argument on Borja's machine.

Non-residual deep nets — what tricks compensated?

Before ResNet, people built deep nets via:

  • Auxiliary losses at intermediate layers (Inception-v1 added two side-classifiers). Forces gradient signal into early layers without residuals.
  • Highway networks (Srivastava et al. 2015) — explicit gating: \(y = T(x) \cdot f(x) + (1 - T(x)) \cdot x\). A learnable interpolation between identity and transform. Residual is "highway with \(T = 1\)" — simpler, same benefit.
  • Careful orthogonal init to keep the Jacobian spectrum near unity. Works to ~30 layers; doesn't scale.

After ResNet, all of these were quietly replaced by residuals. Residual is strictly easier and stronger.

A subtle point: residual is not a "shortcut"

A common misreading: "residual is a shortcut around the layer; the network learns to use the shortcut when it wants to." Wrong. The residual is always on, not gated. The block is parameterized as \(y = x + f(x)\) unconditionally. The network can learn \(f(x) = 0\) (effectively bypass the block) or \(f(x) =\) something nontrivial; either is fine.

The Highway network is gated; the Residual block is not. The simplification (drop the gate) was a strict improvement.

Where this matters in the curriculum

  • Phase 15 (attention): attention is wrapped in a residual block.
  • Phase 17 (transformer block): Pre-LN residual is the canonical block.
  • Phase 21 (training dynamics): the loss landscape under residuals has provably better conditioning (Li et al. 2018).
  • Phase 24 (Triton / framework): PyTorch's residual blocks are 3 lines of code — but you'll understand the math behind them.

Drill problems

Solutions in solutions/03-residuals-ref.md (phase open).

  1. For a 30-block residual stack with each \(f_l\)'s Jacobian having spectral norm \(\rho < 1\), derive an upper bound on the deviation of the total Jacobian from the identity. (Hint: expand the product, use a geometric series.)
  2. Sketch why a residual block with \(f(x) = -x\) would be pathological. (Hint: what does \(y = x + (-x)\) compute? What does the gradient look like?)
  3. In Pre-LN with \(f\) initialized so \(f \approx 0\), the loss landscape near init is approximately that of a 1-layer network applied to the input. Argue why this is better than the loss landscape of a 30-layer non-residual network at init.

One-paragraph recap

A residual connection \(y = x + f(x)\) does two things: it makes the network's initial condition the identity (which is easy to optimize from), and it gives the gradient an unconditional unit-gain path back to early layers (no vanishing). Pre-LN (\(y = x + f(\text{LN}(x))\)) is the modern default because the norm doesn't sit on the residual path; Post-LN (\(y = \text{LN}(x + f(x))\)) does, and compounds Jacobian drift over depth. Without residuals, 50-layer MLPs can't train; with residuals, they train trivially.


Next: theory/04-putting-it-together.md.