Skip to content

English · Español

03 — The two high-stakes derivations: matmul and softmax-CE

🇪🇸 Las dos derivaciones más importantes de la fase. Matmul backward: aparece en cada capa lineal, en cada attention, en cada FFN. Softmax + cross-entropy fundidos en una op: la única forma de tener un gradiente numéricamente estable para la pérdida. Cuando puedas reproducirlas de memoria, has cruzado un hito.


Matmul backward

Forward

C = A @ B where A.shape = (M, K), B.shape = (K, N), so C.shape = (M, N).

Element-wise:

\[ C_{ij} = \sum_{k=0}^{K-1} A_{ik} B_{kj} \]

Derivation by indices

We want ∂L/∂A and ∂L/∂B, given ∂L/∂C (the upstream gradient, shape (M, N)).

∂L/∂A_{pq}: how does L change when we wiggle A_{pq}? Through every C_{ij} that depends on A_{pq}. From the forward formula, A_{pq} appears in C_{pj} for all j (with coefficient B_{qj}). So:

\[ \frac{\partial L}{\partial A_{pq}} = \sum_{j=0}^{N-1} \frac{\partial L}{\partial C_{pj}} \cdot B_{qj} \]

Look at the index structure: this is ∂L/∂C (row p, columns j) dotted with B (column q, rows j — i.e., row q of B.T). So:

\[ \frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} \cdot B^T \]

∂L/∂B_{rs}: similarly, B_{rs} appears in C_{is} for all i, with coefficient A_{ir}. So:

\[ \frac{\partial L}{\partial B_{rs}} = \sum_{i=0}^{M-1} \frac{\partial L}{\partial C_{is}} \cdot A_{ir} = \sum_{i} A_{ir} \cdot \frac{\partial L}{\partial C_{is}} \]

This is A.T (row r, columns i — i.e., column r of A) dotted with ∂L/∂C (column s, rows i). So:

\[ \frac{\partial L}{\partial B} = A^T \cdot \frac{\partial L}{\partial C} \]

The two formulas to memorize

\[\frac{\partial L}{\partial A} = \frac{\partial L}{\partial C} \cdot B^T, \quad \frac{\partial L}{\partial B} = A^T \cdot \frac{\partial L}{\partial C}\]

In code:

# C = A @ B
def _backward():
    if A.requires_grad:
        A.grad = (A.grad or 0) + C.grad @ B.data.T
    if B.requires_grad:
        B.grad = (B.grad or 0) + A.data.T @ C.grad

Shape check: - C.grad @ B.T: (M, N) @ (N, K) → (M, K). Matches A.shape. ✓ - A.T @ C.grad: (K, M) @ (M, N) → (K, N). Matches B.shape. ✓

Batch dimensions

For batched matmul, A.shape = (..., M, K), B.shape = (..., K, N), C.shape = (..., M, N). The "..." are batch dims that broadcast just like elementwise.

Backward: same formulas, but T becomes "swap the last two axes" (np.swapaxes(B, -1, -2)), and the batch dims may need unbroadcast if there was broadcasting in the batch dims.

In code:

def _backward():
    A_contrib = C.grad @ np.swapaxes(B.data, -1, -2)
    B_contrib = np.swapaxes(A.data, -1, -2) @ C.grad
    A.grad = (A.grad or 0) + unbroadcast(A_contrib, A.data.shape)
    B.grad = (B.grad or 0) + unbroadcast(B_contrib, B.data.shape)

Use np.swapaxes(x, -1, -2) not x.T — the latter reverses all axes in NumPy for 2-D arrays, but you specifically want the last two.

Sanity check with the worked example

Take A = [[1, 2], [3, 4]] (2×2), B = [[5], [6]] (2×1). Then C = A @ B = [[17], [39]] (2×1).

Suppose ∂L/∂C = [[1], [1]] (upstream gradient — say L = C.sum()).

∂L/∂A = ∂L/∂C @ B.T = [[1], [1]] @ [[5, 6]] = [[5, 6], [5, 6]]. Check by hand: L = 17 + 39 = (A[0,0]·5 + A[0,1]·6) + (A[1,0]·5 + A[1,1]·6). So ∂L/∂A = [[5, 6], [5, 6]]. ✓

∂L/∂B = A.T @ ∂L/∂C = [[1, 3], [2, 4]] @ [[1], [1]] = [[4], [6]]. Check by hand: L = (1+3)·B[0,0] + (2+4)·B[1,0] = 4·B[0,0] + 6·B[1,0]. So ∂L/∂B = [[4], [6]]. ✓

This is the single most important derivation in Phase 8. Memorize the two formulas. Re-derive them from scratch if you ever forget — the index manipulation is short.

Softmax backward (standalone) — and why we avoid it

Forward

For a length-N vector x, softmax is:

\[ s_i = \frac{e^{x_i}}{\sum_{j} e^{x_j}} \]

The output is a probability distribution: all s_i ∈ (0, 1), sum to 1.

Jacobian

The Jacobian of softmax has off-diagonal entries:

\[ \frac{\partial s_i}{\partial x_j} = s_i (\delta_{ij} - s_j) \]

where δ_ij = 1 if i==j else 0.

Derive: by the quotient rule on s_i = exp(x_i) / Z with Z = Σⱼ exp(xⱼ):

\[ \frac{\partial s_i}{\partial x_j} = \frac{\partial}{\partial x_j} \left[ \frac{e^{x_i}}{Z} \right] \]

Case i = j: ∂(e^{x_i})/∂x_i = e^{x_i}; ∂Z/∂x_i = e^{x_i}. By quotient rule: $$ \frac{\partial s_i}{\partial x_i} = \frac{e^{x_i} \cdot Z - e^{x_i} \cdot e{x_i}}{Z2} = \frac{e^{x_i}}{Z} - \left(\frac{e{x_i}}{Z}\right)2 = s_i - s_i^2 = s_i(1 - s_i) $$

Case i ≠ j: ∂(e^{x_i})/∂x_j = 0; ∂Z/∂x_j = e^{x_j}. So: $$ \frac{\partial s_i}{\partial x_j} = \frac{0 \cdot Z - e^{x_i} \cdot e{x_j}}{Z2} = -s_i s_j $$

Combined: ∂s_i/∂x_j = s_i(δ_ij - s_j). ✓

Backward in matrix form

Given upstream ∂L/∂s (a length-N vector), the contribution to ∂L/∂x is:

\[ \frac{\partial L}{\partial x_j} = \sum_i \frac{\partial L}{\partial s_i} \cdot \frac{\partial s_i}{\partial x_j} = \sum_i \frac{\partial L}{\partial s_i} \cdot s_i (\delta_{ij} - s_j) \]
\[ = s_j \left( \frac{\partial L}{\partial s_j} - \sum_i s_i \cdot \frac{\partial L}{\partial s_i} \right) \]

In code:

# s = softmax(x)
def _backward():
    # s.shape = x.shape = (..., N)
    weighted_sum = (s.data * s.grad).sum(axis=-1, keepdims=True)
    x.grad = (x.grad or 0) + s.data * (s.grad - weighted_sum)

Why we don't typically use softmax standalone in training

Because the Jacobian is dense, and the typical use is to feed softmax into cross-entropy, and the combined gradient simplifies enormously.

Standalone softmax is still useful (for inference, for attention weights). We implement it. But we provide a separate fused cross_entropy(logits, targets) op for training.

Cross-entropy from logits (the fused op)

Forward

Given logits x ∈ R^{B × C} (batch B, C classes) and integer targets y ∈ Z^B (label index per example), the cross-entropy loss is:

\[ L = -\frac{1}{B} \sum_{b=0}^{B-1} \log\left( \text{softmax}(x_b)_{y_b} \right) \]

The mean over the batch is standard; some frameworks return sum (with reduction='sum'). We default to mean.

Numerically stable computation: never materialize log(softmax(x)) directly — for very negative logits, softmax underflows to 0 and log(0) = -inf. Instead:

\[ \log \text{softmax}(x_b)_c = x_{b,c} - \log\sum_{j} e^{x_{b,j}} = x_{b,c} - \text{logsumexp}(x_b) \]

And logsumexp(x_b) = m + log(sum(exp(x_b - m))) where m = max(x_b) — the standard trick from Phase 2.

In code, forward:

# x.shape = (B, C); y.shape = (B,) int
m = x.data.max(axis=-1, keepdims=True)
log_sum_exp = m + np.log(np.exp(x.data - m).sum(axis=-1, keepdims=True))
log_softmax = x.data - log_sum_exp  # shape (B, C)
nll = -log_softmax[np.arange(B), y.data]  # shape (B,) — pick the target class's log-softmax
L_data = nll.mean()

The beautiful identity

The combined gradient with respect to logits:

\[ \frac{\partial L}{\partial x_{b,c}} = \frac{1}{B} \left( \text{softmax}(x_b)_c - \mathbb{1}[c = y_b] \right) \]

That is: (predicted probability) minus (one-hot target), divided by batch size.

Derivation

The standalone softmax Jacobian: ∂s_i/∂x_j = s_i (δ_ij - s_j).

The log-of-softmax derivative: $$ \frac{\partial \log s_i}{\partial x_j} = \frac{1}{s_i} \cdot s_i(\delta_{ij} - s_j) = \delta_{ij} - s_j $$

CE with target y: L_b = -log(s_y). So: $$ \frac{\partial L_b}{\partial x_{b,j}} = -(\delta_{y,j} - s_j) = s_j - \delta_{y,j} $$

That's the gradient per example. Averaging over the batch gives (s - one_hot(y))/B.

The implementation

# L = cross_entropy(x, y)
def forward():
    B, C = x.data.shape
    m = x.data.max(axis=-1, keepdims=True)
    log_sum_exp = m + np.log(np.exp(x.data - m).sum(axis=-1, keepdims=True))
    log_softmax = x.data - log_sum_exp
    nll = -log_softmax[np.arange(B), y.data]
    L_data = nll.mean()
    return L_data, log_softmax  # keep log_softmax for backward

def _backward(log_softmax):
    B, C = x.data.shape
    softmax = np.exp(log_softmax)
    one_hot = np.zeros_like(softmax)
    one_hot[np.arange(B), y.data] = 1.0
    grad = (softmax - one_hot) / B
    x.grad = (x.grad or 0) + grad * L.grad   # L.grad is scalar (we called backward on the loss)

Why fuse?

  1. Numerical stability. log(softmax(x)) is x - logsumexp(x). We never compute log(0). Same for the gradient: softmax - one_hot is well-conditioned.
  2. Compute economy. One pass instead of two.
  3. Memory. No intermediate (B, C) softmax tensor in the graph (it's computed in _backward from cached log_softmax).
  4. Pedagogical clarity. The identity grad = softmax - one_hot is one of ML's most elegant facts. Worth seeing in code.

PyTorch's F.cross_entropy(logits, targets) is exactly this fused op, with identical math. Our minitorch.tensor.cross_entropy mirrors it.

Cross-check pattern

def test_cross_entropy():
    rng = np.random.default_rng(42)
    x_data = rng.standard_normal((4, 3)).astype(np.float64)
    y_data = np.array([0, 2, 1, 0], dtype=np.int64)

    x = Tensor(x_data, requires_grad=True)
    y = Tensor(y_data)
    L = cross_entropy(x, y)
    L.backward()

    tx = torch.tensor(x_data, dtype=torch.float64, requires_grad=True)
    ty = torch.tensor(y_data, dtype=torch.long)
    tL = torch.nn.functional.cross_entropy(tx, ty)
    tL.backward()

    assert np.allclose(L.data, tL.item(), atol=1e-9)
    assert np.allclose(x.grad, tx.grad.numpy(), atol=1e-9)

If the two gradients agree to 1e-9, the op is correct.

Summary table

Op Forward shape Local Jacobian Backward formula
matmul(A, B) (..., M, K) @ (..., K, N) → (..., M, N) tensor-shaped dA = dC @ B.T, dB = A.T @ dC
softmax(x) preserves shape dense Jacobian s_i(δ_ij - s_j) s · (ds - sum(s · ds))
cross_entropy(logits, y) (B, C), (B,) → () (scalar) combined (softmax(logits) - one_hot(y)) / B

These three derivations are the most important content of Phase 8. Internalize them.

One-paragraph recap

Matmul backward is dA = dC @ B.T and dB = A.T @ dC — derive by indices once, memorize. Standalone softmax has a dense Jacobian (s_i(δ_ij - s_j)) but is rarely used in training because the combined cross-entropy-from-logits op simplifies to the beautiful identity grad = softmax - one_hot, divided by batch size. The fused op is also numerically stable (no log(0)) and PyTorch matches our implementation to 1e-9. These three derivations are the high-stakes deliverable of Phase 8; everything else is straightforward.


Next: 04-gradcheck-and-property-tests.md