English · Español

02 — The chain rule and the gradient of softmax + cross-entropy¶

🇪🇸 La regla de la cadena para funciones vectoriales es producto de Jacobianos. Aplicada a softmax + cross-entropy, colapsa en una línea: ∂CE/∂x = softmax(x) - one_hot(y). Esta página deriva ambos.

This is the theory page of Phase 4. Re-derive everything below from a blank page until it's mechanical.

Single-variable chain rule¶

For h(x) = f(g(x)):

\[ h'(x) = f'(g(x)) \cdot g'(x) \]

The derivative of the outer function evaluated at the inner, times the derivative of the inner. Memorize.

Worked: h(x) = sin(x²). With f(u) = sin(u), g(x) = x²:

\[ h'(x) = \cos(x^2) \cdot 2x \]

Multivariate chain rule (the version backprop uses)¶

For f: R^k → R^m and g: R^n → R^k, let h = f ∘ g: R^n → R^m. Then:

\[ J_h(x) = J_f(g(x)) \cdot J_g(x) \]

Matrix product of Jacobians. Shapes: (m × n) = (m × k) · (k × n). The inner dimension k matches by construction (output of g = input of f).

For a scalar loss L = f(g(x)) with f: R^k → R and g: R^n → R^k:

J_f = ∇_y L^T is a 1 × k row vector (transpose of the gradient).
J_g is k × n.
J_h = ∇_x L^T = ∇_y L^T · J_g is 1 × n.

Transposing to get the gradient as a column:

\[ \nabla_x L = J_g^T \cdot \nabla_y L \]

This is the formula backprop uses. Read it as: "to compute the gradient w.r.t. the input of g, multiply the Jacobian of g (transposed) by the gradient w.r.t. the output of g."

Apply layer by layer:

\[ \nabla_x L = J_{L_1}^T \cdot J_{L_2}^T \cdots J_{L_n}^T \cdot \nabla_{y_n} L \]

reading right to left. Start with the gradient at the output (∇_y L, usually trivial — e.g., for L = scalar, ∇_y L = 1). At each layer, multiply by the layer's transposed Jacobian. Continue back to the input.

This walk is what every autograd framework implements. Phase 7 will code it directly.

Why right-to-left?¶

Two equivalent ways to compute the gradient:

Forward-mode (left-to-right): propagate Jacobians forward, multiplying (m × k) × (k × n) at each step. For deep networks with m = 1 (scalar loss) but n huge (millions of parameters), each step is a big matrix-matrix product.
Reverse-mode (right-to-left): start with the vector ∇_y L: (m,) = (1,) and propagate it backward. At each step, you only multiply matrix × vector, never matrix × matrix.

For deep networks where m = 1 (one loss) and n huge (model parameters), reverse mode is vastly cheaper: O(parameters) per backward pass vs O(parameters²) for forward mode.

Conversely, if m is huge (many outputs) and n small (few inputs), forward mode wins.

For ML: always reverse-mode. This is the reason "backpropagation" is the name — you propagate gradients backward through the graph.

Worked: `y = Wx + b` then `L = ||y||²`¶

Layers: L = f(y), y = g(x) = Wx + b.

f(y) = ||y||² = y^T y. ∇_y L = 2y. (Column.)
g(x) = Wx + b. J_g(x) = W.

So:

\[ \nabla_x L = J_g^T \cdot \nabla_y L = W^T \cdot 2y = 2 W^T (W x + b) \]

Sanity check: direct expansion. L(x) = ||Wx + b||² = (Wx + b)^T (Wx + b). Direct gradient: ∇_x L = 2 W^T (Wx + b). Same answer.

The point isn't this particular formula; it's that the chain rule via Jacobians produces the right answer mechanically, without re-deriving from scratch each time.

Worked: gradient of cross-entropy through softmax¶

This is the most important gradient derivation in Phase 4. We'll do it three ways: by hand, via the chain rule, and as a sanity check.

Setup¶

Logits x ∈ R^V (V = vocab size).
Probabilities p = softmax(x), so p_i = e^{x_i} / Σ_j e^{x_j}.
True label y ∈ {0, ..., V-1}.
Loss L = -log p_y = -x_y + log Σ_j e^{x_j} (the stable CE form from Phase 2).

We want ∂L/∂x_i for each i.

By hand¶

\[ \frac{\partial L}{\partial x_i} = \frac{\partial}{\partial x_i} \left( -x_y + \log \sum_j e^{x_j} \right) \]

Two terms.

First term: ∂(-x_y)/∂x_i = -δ_{iy} (Kronecker delta: 1 if i = y, else 0).

Second term: ∂(log Σ_j e^{x_j}) / ∂x_i. Let S = Σ_j e^{x_j}. Then log S differentiates as (1/S) · ∂S/∂x_i = (1/S) · e^{x_i} = p_i.

Combine:

\[ \frac{\partial L}{\partial x_i} = -\delta_{iy} + p_i = (p - \text{one\_hot}(y))_i \]

Or as a vector:

\[ \boxed{\nabla_x L = p - \text{one\_hot}(y)} \]

This collapses to a single line. No log, no exp, no inverse — the messy intermediate calculus cancels.

Why it's beautiful¶

Three reasons:

Numerical: computing softmax(x) - one_hot(y) is one subtraction. No risk of underflow / overflow beyond what stable_softmax already handles. Phase 7's autograd implements this directly as the backward of cross-entropy.
Computational: if the model's forward pass computed p = softmax(x), then the backward pass just subtracts one_hot. Zero new computation needed. This is why every framework fuses softmax + cross-entropy as one op.
Pedagogical: the most-used loss function in deep learning has the cleanest possible gradient. There's a reason this is the loss everyone uses for classification — its calculus is friendly.

Via the chain rule¶

For completeness, let's redo it as composition of two Jacobians (so we can see the explicit matmul cancellation).

Inner layer: p = softmax(x). Jacobian J_softmax(x) = diag(p) - p p^T (you'll derive this in lab 00).
Outer layer: L = -log(p_y). Gradient ∇_p L = -(1/p_y) · e_y where e_y is the standard basis vector. So (∇_p L)_i = -δ_{iy} / p_y.

Apply chain rule:

\[ \nabla_x L = J_{softmax}^T \cdot \nabla_p L = (\text{diag}(p) - p p^T) \cdot \left( -\frac{1}{p_y} e_y \right) \]

Compute the matrix-vector product:

diag(p) · (-e_y / p_y) = -e_y (the e_y-th entry is p_y · (-1/p_y) = -1; others are 0).
p p^T · (-e_y / p_y) = -p_y · p / p_y = -p.

So:

\[ \nabla_x L = -e_y - (-p) = p - e_y = p - \text{one\_hot}(y) \]

Same answer. The intermediate Jacobian's diagonal-plus-rank-one structure cancels exactly into the one-hot subtraction.

This is what Phase 7's autograd code will implement implicitly — it's why cross_entropy_with_logits exists as a fused op in every framework.

Sanity check via finite differences¶

In lab 00 you'll verify the analytical result with numerical differentiation. Use centred differences:

\[ \frac{\partial L}{\partial x_i} \approx \frac{L(x + h e_i) - L(x - h e_i)}{2h} \]

with h = 1e-4. The numerical and analytical answers should agree to about 1e-4 — limited by h² truncation error.

If you see a disagreement larger than 1e-3, your analytical derivation is wrong. (Don't blame numerical noise without re-deriving.)

Worked: gradient of `(W_2 relu(W_1 x))`¶

A two-layer linear-relu-linear chain. We won't fully derive this here (lab 01 does), but the chain rule walk is:

Forward:
z_1 = W_1 x
a_1 = relu(z_1)
z_2 = W_2 a_1
L = scalar function of z_2
Backward (start with g_2 = ∇_{z_2} L):
∇_{a_1} L = W_2^T g_2 (linear layer Jacobian = W^T).
∇_{z_1} L = ∇_{a_1} L * relu'(z_1) (element-wise * because Jacobian of relu is diagonal).
∇_x L = W_1^T · ∇_{z_1} L.

The parameter gradients:

∇_{W_2} L = g_2 · a_1^T (outer product — gradient is rank 1 contribution per sample).
∇_{W_1} L = (∇_{z_1} L) · x^T.

You'll re-derive this in lab 01 and verify with finite differences.

Drill problems¶

Solutions in solutions/02-chain-rule-and-backprop-ref.md at phase open.

Derive ∂/∂x sigmoid(x). Show it equals σ(x)(1 - σ(x)).
Derive J_softmax(x) for length-n x. Show it's diag(p) - p p^T.
Derive ∂ L / ∂ W_1 for the 2-layer MLP above. Show the answer is the outer product g_1 · x^T where g_1 = ∇_{z_1} L.
The "log-softmax" function is log_softmax(x)_i = x_i - log_sum_exp(x). Derive its Jacobian. Show that ∂ log_softmax(x)_i / ∂ x_j = δ_{ij} - p_j.
Why does -log(softmax(x)_y) collapse so cleanly while softmax(x)_y itself does not? (Hint: the log undoes one of the exps.)
For the loss L = -log softmax(x)_y, verify numerically (centred differences, h = 1e-4) that ∇_x L = softmax(x) - one_hot(y) for a specific x of your choosing.

One-paragraph recap¶

The chain rule for vector-valued functions is matrix multiplication of Jacobians, applied right-to-left when computing a scalar loss's gradient — the "reverse-mode" of automatic differentiation that backprop implements. The single cleanest gradient in deep learning is for cross-entropy through softmax: ∇_x L = softmax(x) - one_hot(y), derivable in three lines and computed essentially for free from the forward pass. Every Phase-7-and-later autograd operation is just a special case of "store intermediate values during forward; walk the graph in reverse, multiplying by transposed Jacobians." Phase 4's job is to make that derivation reflexive; Phase 7 turns it into code.

Next: theory/03-optimizers-derived.md.