English · Español
03 — The two high-stakes derivations: matmul and softmax-CE¶
🇪🇸 Las dos derivaciones más importantes de la fase. Matmul backward: aparece en cada capa lineal, en cada attention, en cada FFN. Softmax + cross-entropy fundidos en una op: la única forma de tener un gradiente numéricamente estable para la pérdida. Cuando puedas reproducirlas de memoria, has cruzado un hito.
Matmul backward¶
Forward¶
C = A @ B where A.shape = (M, K), B.shape = (K, N), so C.shape = (M, N).
Element-wise:
Derivation by indices¶
We want ∂L/∂A and ∂L/∂B, given ∂L/∂C (the upstream gradient, shape (M, N)).
∂L/∂A_{pq}: how does L change when we wiggle A_{pq}? Through every C_{ij} that depends on A_{pq}. From the forward formula, A_{pq} appears in C_{pj} for all j (with coefficient B_{qj}). So:
Look at the index structure: this is ∂L/∂C (row p, columns j) dotted with B (column q, rows j — i.e., row q of B.T). So:
∂L/∂B_{rs}: similarly, B_{rs} appears in C_{is} for all i, with coefficient A_{ir}. So:
This is A.T (row r, columns i — i.e., column r of A) dotted with ∂L/∂C (column s, rows i). So:
The two formulas to memorize¶
\[\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} \cdot B^T, \quad \frac{\partial L}{\partial B} = A^T \cdot \frac{\partial L}{\partial C}\]
In code:
# C = A @ B
def _backward():
if A.requires_grad:
A.grad = (A.grad or 0) + C.grad @ B.data.T
if B.requires_grad:
B.grad = (B.grad or 0) + A.data.T @ C.grad
Shape check:
- C.grad @ B.T: (M, N) @ (N, K) → (M, K). Matches A.shape. ✓
- A.T @ C.grad: (K, M) @ (M, N) → (K, N). Matches B.shape. ✓
Batch dimensions¶
For batched matmul, A.shape = (..., M, K), B.shape = (..., K, N), C.shape = (..., M, N). The "..." are batch dims that broadcast just like elementwise.
Backward: same formulas, but T becomes "swap the last two axes" (np.swapaxes(B, -1, -2)), and the batch dims may need unbroadcast if there was broadcasting in the batch dims.
In code:
def _backward():
A_contrib = C.grad @ np.swapaxes(B.data, -1, -2)
B_contrib = np.swapaxes(A.data, -1, -2) @ C.grad
A.grad = (A.grad or 0) + unbroadcast(A_contrib, A.data.shape)
B.grad = (B.grad or 0) + unbroadcast(B_contrib, B.data.shape)
Use np.swapaxes(x, -1, -2) not x.T — the latter reverses all axes in NumPy for 2-D arrays, but you specifically want the last two.
Sanity check with the worked example¶
Take A = [[1, 2], [3, 4]] (2×2), B = [[5], [6]] (2×1). Then C = A @ B = [[17], [39]] (2×1).
Suppose ∂L/∂C = [[1], [1]] (upstream gradient — say L = C.sum()).
∂L/∂A = ∂L/∂C @ B.T = [[1], [1]] @ [[5, 6]] = [[5, 6], [5, 6]]. Check by hand: L = 17 + 39 = (A[0,0]·5 + A[0,1]·6) + (A[1,0]·5 + A[1,1]·6). So ∂L/∂A = [[5, 6], [5, 6]]. ✓
∂L/∂B = A.T @ ∂L/∂C = [[1, 3], [2, 4]] @ [[1], [1]] = [[4], [6]]. Check by hand: L = (1+3)·B[0,0] + (2+4)·B[1,0] = 4·B[0,0] + 6·B[1,0]. So ∂L/∂B = [[4], [6]]. ✓
This is the single most important derivation in Phase 8. Memorize the two formulas. Re-derive them from scratch if you ever forget — the index manipulation is short.
Softmax backward (standalone) — and why we avoid it¶
Forward¶
For a length-N vector x, softmax is:
The output is a probability distribution: all s_i ∈ (0, 1), sum to 1.
Jacobian¶
The Jacobian of softmax has off-diagonal entries:
where δ_ij = 1 if i==j else 0.
Derive: by the quotient rule on s_i = exp(x_i) / Z with Z = Σⱼ exp(xⱼ):
Case i = j: ∂(e^{x_i})/∂x_i = e^{x_i}; ∂Z/∂x_i = e^{x_i}. By quotient rule:
$$
\frac{\partial s_i}{\partial x_i} = \frac{e^{x_i} \cdot Z - e^{x_i} \cdot e{x_i}}{Z2} = \frac{e^{x_i}}{Z} - \left(\frac{e{x_i}}{Z}\right)2 = s_i - s_i^2 = s_i(1 - s_i)
$$
Case i ≠ j: ∂(e^{x_i})/∂x_j = 0; ∂Z/∂x_j = e^{x_j}. So:
$$
\frac{\partial s_i}{\partial x_j} = \frac{0 \cdot Z - e^{x_i} \cdot e{x_j}}{Z2} = -s_i s_j
$$
Combined: ∂s_i/∂x_j = s_i(δ_ij - s_j). ✓
Backward in matrix form¶
Given upstream ∂L/∂s (a length-N vector), the contribution to ∂L/∂x is:
In code:
# s = softmax(x)
def _backward():
# s.shape = x.shape = (..., N)
weighted_sum = (s.data * s.grad).sum(axis=-1, keepdims=True)
x.grad = (x.grad or 0) + s.data * (s.grad - weighted_sum)
Why we don't typically use softmax standalone in training¶
Because the Jacobian is dense, and the typical use is to feed softmax into cross-entropy, and the combined gradient simplifies enormously.
Standalone softmax is still useful (for inference, for attention weights). We implement it. But we provide a separate fused cross_entropy(logits, targets) op for training.
Cross-entropy from logits (the fused op)¶
Forward¶
Given logits x ∈ R^{B × C} (batch B, C classes) and integer targets y ∈ Z^B (label index per example), the cross-entropy loss is:
The mean over the batch is standard; some frameworks return sum (with reduction='sum'). We default to mean.
Numerically stable computation: never materialize log(softmax(x)) directly — for very negative logits, softmax underflows to 0 and log(0) = -inf. Instead:
And logsumexp(x_b) = m + log(sum(exp(x_b - m))) where m = max(x_b) — the standard trick from Phase 2.
In code, forward:
# x.shape = (B, C); y.shape = (B,) int
m = x.data.max(axis=-1, keepdims=True)
log_sum_exp = m + np.log(np.exp(x.data - m).sum(axis=-1, keepdims=True))
log_softmax = x.data - log_sum_exp # shape (B, C)
nll = -log_softmax[np.arange(B), y.data] # shape (B,) — pick the target class's log-softmax
L_data = nll.mean()
The beautiful identity¶
The combined gradient with respect to logits:
That is: (predicted probability) minus (one-hot target), divided by batch size.
Derivation¶
The standalone softmax Jacobian: ∂s_i/∂x_j = s_i (δ_ij - s_j).
The log-of-softmax derivative: $$ \frac{\partial \log s_i}{\partial x_j} = \frac{1}{s_i} \cdot s_i(\delta_{ij} - s_j) = \delta_{ij} - s_j $$
CE with target y: L_b = -log(s_y). So:
$$
\frac{\partial L_b}{\partial x_{b,j}} = -(\delta_{y,j} - s_j) = s_j - \delta_{y,j}
$$
That's the gradient per example. Averaging over the batch gives (s - one_hot(y))/B.
The implementation¶
# L = cross_entropy(x, y)
def forward():
B, C = x.data.shape
m = x.data.max(axis=-1, keepdims=True)
log_sum_exp = m + np.log(np.exp(x.data - m).sum(axis=-1, keepdims=True))
log_softmax = x.data - log_sum_exp
nll = -log_softmax[np.arange(B), y.data]
L_data = nll.mean()
return L_data, log_softmax # keep log_softmax for backward
def _backward(log_softmax):
B, C = x.data.shape
softmax = np.exp(log_softmax)
one_hot = np.zeros_like(softmax)
one_hot[np.arange(B), y.data] = 1.0
grad = (softmax - one_hot) / B
x.grad = (x.grad or 0) + grad * L.grad # L.grad is scalar (we called backward on the loss)
Why fuse?¶
- Numerical stability.
log(softmax(x))isx - logsumexp(x). We never computelog(0). Same for the gradient:softmax - one_hotis well-conditioned. - Compute economy. One pass instead of two.
- Memory. No intermediate
(B, C)softmax tensor in the graph (it's computed in_backwardfrom cachedlog_softmax). - Pedagogical clarity. The identity
grad = softmax - one_hotis one of ML's most elegant facts. Worth seeing in code.
PyTorch's F.cross_entropy(logits, targets) is exactly this fused op, with identical math. Our minitorch.tensor.cross_entropy mirrors it.
Cross-check pattern¶
def test_cross_entropy():
rng = np.random.default_rng(42)
x_data = rng.standard_normal((4, 3)).astype(np.float64)
y_data = np.array([0, 2, 1, 0], dtype=np.int64)
x = Tensor(x_data, requires_grad=True)
y = Tensor(y_data)
L = cross_entropy(x, y)
L.backward()
tx = torch.tensor(x_data, dtype=torch.float64, requires_grad=True)
ty = torch.tensor(y_data, dtype=torch.long)
tL = torch.nn.functional.cross_entropy(tx, ty)
tL.backward()
assert np.allclose(L.data, tL.item(), atol=1e-9)
assert np.allclose(x.grad, tx.grad.numpy(), atol=1e-9)
If the two gradients agree to 1e-9, the op is correct.
Summary table¶
| Op | Forward shape | Local Jacobian | Backward formula |
|---|---|---|---|
matmul(A, B) |
(..., M, K) @ (..., K, N) → (..., M, N) |
tensor-shaped | dA = dC @ B.T, dB = A.T @ dC |
softmax(x) |
preserves shape | dense Jacobian s_i(δ_ij - s_j) |
s · (ds - sum(s · ds)) |
cross_entropy(logits, y) |
(B, C), (B,) → () (scalar) |
combined | (softmax(logits) - one_hot(y)) / B |
These three derivations are the most important content of Phase 8. Internalize them.
One-paragraph recap¶
Matmul backward is dA = dC @ B.T and dB = A.T @ dC — derive by indices once, memorize. Standalone softmax has a dense Jacobian (s_i(δ_ij - s_j)) but is rarely used in training because the combined cross-entropy-from-logits op simplifies to the beautiful identity grad = softmax - one_hot, divided by batch size. The fused op is also numerically stable (no log(0)) and PyTorch matches our implementation to 1e-9. These three derivations are the high-stakes deliverable of Phase 8; everything else is straightforward.