English · Español
02 — The chain rule and the gradient of softmax + cross-entropy¶
🇪🇸 La regla de la cadena para funciones vectoriales es producto de Jacobianos. Aplicada a softmax + cross-entropy, colapsa en una línea:
∂CE/∂x = softmax(x) - one_hot(y). Esta página deriva ambos.
This is the theory page of Phase 4. Re-derive everything below from a blank page until it's mechanical.
Single-variable chain rule¶
For h(x) = f(g(x)):
The derivative of the outer function evaluated at the inner, times the derivative of the inner. Memorize.
Worked: h(x) = sin(x²). With f(u) = sin(u), g(x) = x²:
Multivariate chain rule (the version backprop uses)¶
For f: R^k → R^m and g: R^n → R^k, let h = f ∘ g: R^n → R^m. Then:
Matrix product of Jacobians. Shapes: (m × n) = (m × k) · (k × n). The inner dimension k matches by construction (output of g = input of f).
For a scalar loss L = f(g(x)) with f: R^k → R and g: R^n → R^k:
J_f = ∇_y L^Tis a1 × krow vector (transpose of the gradient).J_gisk × n.J_h = ∇_x L^T = ∇_y L^T · J_gis1 × n.
Transposing to get the gradient as a column:
This is the formula backprop uses. Read it as: "to compute the gradient w.r.t. the input of g, multiply the Jacobian of g (transposed) by the gradient w.r.t. the output of g."
Apply layer by layer:
reading right to left. Start with the gradient at the output (∇_y L, usually trivial — e.g., for L = scalar, ∇_y L = 1). At each layer, multiply by the layer's transposed Jacobian. Continue back to the input.
This walk is what every autograd framework implements. Phase 7 will code it directly.
Why right-to-left?¶
Two equivalent ways to compute the gradient:
-
Forward-mode (left-to-right): propagate Jacobians forward, multiplying
(m × k) × (k × n)at each step. For deep networks withm = 1(scalar loss) butnhuge (millions of parameters), each step is a big matrix-matrix product. -
Reverse-mode (right-to-left): start with the vector
∇_y L: (m,) = (1,)and propagate it backward. At each step, you only multiply matrix × vector, never matrix × matrix.
For deep networks where m = 1 (one loss) and n huge (model parameters), reverse mode is vastly cheaper: O(parameters) per backward pass vs O(parameters²) for forward mode.
Conversely, if m is huge (many outputs) and n small (few inputs), forward mode wins.
For ML: always reverse-mode. This is the reason "backpropagation" is the name — you propagate gradients backward through the graph.
Worked: y = Wx + b then L = ||y||²¶
Layers: L = f(y), y = g(x) = Wx + b.
f(y) = ||y||² = y^T y.∇_y L = 2y. (Column.)g(x) = Wx + b.J_g(x) = W.
So:
Sanity check: direct expansion. L(x) = ||Wx + b||² = (Wx + b)^T (Wx + b). Direct gradient: ∇_x L = 2 W^T (Wx + b). Same answer.
The point isn't this particular formula; it's that the chain rule via Jacobians produces the right answer mechanically, without re-deriving from scratch each time.
Worked: gradient of cross-entropy through softmax¶
This is the most important gradient derivation in Phase 4. We'll do it three ways: by hand, via the chain rule, and as a sanity check.
Setup¶
- Logits
x ∈ R^V(V = vocab size). - Probabilities
p = softmax(x), sop_i = e^{x_i} / Σ_j e^{x_j}. - True label
y ∈ {0, ..., V-1}. - Loss
L = -log p_y = -x_y + log Σ_j e^{x_j}(the stable CE form from Phase 2).
We want ∂L/∂x_i for each i.
By hand¶
Two terms.
First term: ∂(-x_y)/∂x_i = -δ_{iy} (Kronecker delta: 1 if i = y, else 0).
Second term: ∂(log Σ_j e^{x_j}) / ∂x_i. Let S = Σ_j e^{x_j}. Then log S differentiates as (1/S) · ∂S/∂x_i = (1/S) · e^{x_i} = p_i.
Combine:
Or as a vector:
This collapses to a single line. No log, no exp, no inverse — the messy intermediate calculus cancels.
Why it's beautiful¶
Three reasons:
-
Numerical: computing
softmax(x) - one_hot(y)is one subtraction. No risk of underflow / overflow beyond whatstable_softmaxalready handles. Phase 7's autograd implements this directly as the backward of cross-entropy. -
Computational: if the model's forward pass computed
p = softmax(x), then the backward pass just subtracts one_hot. Zero new computation needed. This is why every framework fuses softmax + cross-entropy as one op. -
Pedagogical: the most-used loss function in deep learning has the cleanest possible gradient. There's a reason this is the loss everyone uses for classification — its calculus is friendly.
Via the chain rule¶
For completeness, let's redo it as composition of two Jacobians (so we can see the explicit matmul cancellation).
- Inner layer:
p = softmax(x). JacobianJ_softmax(x) = diag(p) - p p^T(you'll derive this in lab 00). - Outer layer:
L = -log(p_y). Gradient∇_p L = -(1/p_y) · e_ywheree_yis the standard basis vector. So(∇_p L)_i = -δ_{iy} / p_y.
Apply chain rule:
Compute the matrix-vector product:
diag(p) · (-e_y / p_y) = -e_y(thee_y-th entry isp_y · (-1/p_y) = -1; others are 0).p p^T · (-e_y / p_y) = -p_y · p / p_y = -p.
So:
Same answer. The intermediate Jacobian's diagonal-plus-rank-one structure cancels exactly into the one-hot subtraction.
This is what Phase 7's autograd code will implement implicitly — it's why cross_entropy_with_logits exists as a fused op in every framework.
Sanity check via finite differences¶
In lab 00 you'll verify the analytical result with numerical differentiation. Use centred differences:
with h = 1e-4. The numerical and analytical answers should agree to about 1e-4 — limited by h² truncation error.
If you see a disagreement larger than 1e-3, your analytical derivation is wrong. (Don't blame numerical noise without re-deriving.)
Worked: gradient of (W_2 relu(W_1 x))¶
A two-layer linear-relu-linear chain. We won't fully derive this here (lab 01 does), but the chain rule walk is:
- Forward:
z_1 = W_1 xa_1 = relu(z_1)z_2 = W_2 a_1-
L = scalar function of z_2 -
Backward (start with
g_2 = ∇_{z_2} L): ∇_{a_1} L = W_2^T g_2(linear layer Jacobian = W^T).∇_{z_1} L = ∇_{a_1} L * relu'(z_1)(element-wise * because Jacobian of relu is diagonal).∇_x L = W_1^T · ∇_{z_1} L.
The parameter gradients:
∇_{W_2} L = g_2 · a_1^T(outer product — gradient is rank 1 contribution per sample).∇_{W_1} L = (∇_{z_1} L) · x^T.
You'll re-derive this in lab 01 and verify with finite differences.
Drill problems¶
Solutions in solutions/02-chain-rule-and-backprop-ref.md at phase open.
- Derive
∂/∂x sigmoid(x). Show it equalsσ(x)(1 - σ(x)). - Derive
J_softmax(x)for length-nx. Show it'sdiag(p) - p p^T. - Derive
∂ L / ∂ W_1for the 2-layer MLP above. Show the answer is the outer productg_1 · x^Twhereg_1 = ∇_{z_1} L. - The "log-softmax" function is
log_softmax(x)_i = x_i - log_sum_exp(x). Derive its Jacobian. Show that∂ log_softmax(x)_i / ∂ x_j = δ_{ij} - p_j. - Why does
-log(softmax(x)_y)collapse so cleanly whilesoftmax(x)_yitself does not? (Hint: thelogundoes one of theexps.) - For the loss
L = -log softmax(x)_y, verify numerically (centred differences,h = 1e-4) that∇_x L = softmax(x) - one_hot(y)for a specificxof your choosing.
One-paragraph recap¶
The chain rule for vector-valued functions is matrix multiplication of Jacobians, applied right-to-left when computing a scalar loss's gradient — the "reverse-mode" of automatic differentiation that backprop implements. The single cleanest gradient in deep learning is for cross-entropy through softmax: ∇_x L = softmax(x) - one_hot(y), derivable in three lines and computed essentially for free from the forward pass. Every Phase-7-and-later autograd operation is just a special case of "store intermediate values during forward; walk the graph in reverse, multiplying by transposed Jacobians." Phase 4's job is to make that derivation reflexive; Phase 7 turns it into code.
Next: theory/03-optimizers-derived.md.