English · Español
Lab 03 — matmul, softmax, cross_entropy: the three high-stakes ops¶
Goal: implement the three ops that the rest of the curriculum depends on.
matmulis the workhorse of every layer;softmaxis the workhorse of every classifier;cross_entropy(from logits) is the loss that drives Phase 9's MLP and Phase 17's transformer. Get them right here — at FP64, with both oracles green — and Phase 9+ inherits a trustworthy foundation.Estimated time: 4–5 hours. This is the longest lab in the phase.
Prereqs: Labs 00 + 01 + 02. Theory
03-matmul-and-softmax-grads.mdre-read end-to-end. Phase 2'sstable_softmaxandstable_cross_entropyopen in another window — we'll reuse the intuition, not the code.
What you produce¶
- Three ops added to
src/minitorch/tensor.py:Tensor.matmul,Tensor.softmax, module-levelcross_entropy(logits, targets, reduction='mean'). - The
@operator wired up:A @ BcallsA.matmul(B). tests/test_matmul.py,tests/test_softmax.py,tests/test_cross_entropy.py— each with PyTorch cross + gradcheck + edge cases.- A grammar-flavoured end-to-end gradient test:
(one_hot_person @ tense_logits → softmax → cross_entropy)matches PyTorch.
Why these three are "high-stakes"¶
Bugs in add show up at the next test boundary. Bugs in matmul/softmax/cross_entropy show up as silent training failures three days later: model trains, loss decreases, validation accuracy plateaus 5% below where it should be. The Phase 8 testing strategy exists primarily to catch bugs in these three ops before they pollute Phase 9–22.
🇪🇸 Las tres operaciones que no te puedes permitir tener mal: matmul (sesgo silencioso en cualquier capa lineal), softmax (overflow/underflow numérico), cross-entropy (la fórmula bonita
softmax - one_hotes trivial de derivar pero fácil de implementar mal en el caso batched). Si las tres pasan PyTorch cross-check y gradcheck, el resto del curriculum descansa sobre cimientos sólidos.
TODOs¶
Block A — matmul¶
-
__matmul__(self, other) -> Tensor: forwardself.data @ other.data. Backward by the derivation intheory/03: self.grad += out.grad @ other.data.swapaxes(-1, -2)other.grad += self.data.swapaxes(-1, -2) @ out.grad- Use
_unbroadcaston each: matmul broadcasts batch dims.A.shape = (1, 3, 4) @ B.shape = (B, 4, 5)produces(B, 3, 5). The(1, 3, 4)parent needs the gradient summed over the batch dim. - Wire
__matmul__so thatA @ Bworks. - Test matrix: 2D × 2D, 2D × 1D (vector), 1D × 2D, batched 3D × 3D, batched 3D × 2D (with broadcast).
Block B — softmax¶
-
softmax(self, axis=-1) -> Tensor: forward uses the max-subtraction stability trick (Phase 2):Backward by the Jacobian-vector product form (theoryshifted = self.data - self.data.max(axis=axis, keepdims=True) exps = np.exp(shifted) out_data = exps / exps.sum(axis=axis, keepdims=True)03): Derive on paper why this is the dense Jacobian collapsed to a vector formula. Spend 20 minutes on the derivation; the result is one of those identities that becomes "obvious" once seen. - Test against PyTorch on rank-1, rank-2, rank-3 tensors. Stress test with logits of magnitude
1e3(must not overflow).
Block C — cross_entropy from logits¶
-
cross_entropy(logits: Tensor, targets: IntArray, reduction: str = 'mean') -> Tensor(module-level function): targetsis a plainnp.ndarrayof integers (not aTensor) — targets don't have gradients.- Forward uses the log-sum-exp trick (Phase 2's
stable_cross_entropy):# logits.shape = (B, V); targets.shape = (B,) lse = log_sum_exp(logits.data, axis=-1) # shape (B,) target_logit = np.take_along_axis(logits.data, targets[:, None], axis=-1).squeeze(-1) losses = lse - target_logit # shape (B,) if reduction == 'mean': out_data = losses.mean() elif reduction == 'sum': out_data = losses.sum() elif reduction == 'none': out_data = losses - Backward — the one beautiful identity:
# Cache the softmax probabilities `p` (NOT the logits' exps — recompute stably). p = softmax(logits.data, axis=-1) # stable grad = p.copy() np.add.at(grad, (np.arange(B), targets), -1.0) # subtract one-hot if reduction == 'mean': grad /= B # multiply by upstream out.grad (scalar for mean/sum, vector for none) logits.grad += grad * upstream - Do not implement
cross_entropyas(-target_log_softmax).mean()chained from existing ops. The combined op is numerically stable and avoids materializing log(softmax). Implement as a single fused op. (This mirrors PyTorch'sF.cross_entropy.)
Block D — end-to-end grammar test¶
After Blocks A–C work in isolation, build the grammar mini-graph and verify against PyTorch.
def test_grammar_pipeline_matches_pytorch():
# The §A13 baseline: select a person, project against tense logits, classify.
rng = np.random.default_rng(42)
person_onehot = np.array([0.0, 1.0, 0.0]) # "you" (2nd person)
W = rng.standard_normal((3, 5)) # person → tense
targets = np.array([2]) # past-simple
# ours
P = Tensor(person_onehot[None, :], requires_grad=False) # (1, 3)
Wt = Tensor(W, requires_grad=True) # (3, 5)
logits = P @ Wt # (1, 5)
loss = cross_entropy(logits, targets)
loss.backward()
# pytorch
Pt = torch.tensor(person_onehot[None, :], dtype=torch.float64)
Wpt = torch.tensor(W, dtype=torch.float64, requires_grad=True)
logits_t = Pt @ Wpt
loss_t = torch.nn.functional.cross_entropy(logits_t, torch.tensor(targets, dtype=torch.long))
loss_t.backward()
np.testing.assert_allclose(Wt.grad, Wpt.grad.numpy(), rtol=1e-7)
Numerical stability stress tests¶
These tests must pass. They're the difference between an autograd that works on the textbook example and one that works in Phase 17's transformer.
def test_softmax_large_logits():
# Logits with magnitude 1e3 must not overflow.
x = Tensor(np.array([1000.0, 1001.0, 999.0]), requires_grad=True)
y = x.softmax()
y.sum().backward()
assert np.all(np.isfinite(y.data))
assert np.all(np.isfinite(x.grad))
assert np.isclose(y.data.sum(), 1.0)
def test_cross_entropy_confident_correct():
# Confident, correct prediction → ~0 loss; gradient is ~0.
logits = Tensor(np.array([[1000.0, 0.0, 0.0]]), requires_grad=True)
targets = np.array([0])
loss = cross_entropy(logits, targets)
loss.backward()
assert loss.data < 1e-10
assert np.allclose(logits.grad, np.array([[0.0, 0.0, 0.0]]), atol=1e-9)
def test_cross_entropy_confident_wrong():
# Confident, wrong prediction → loss ≈ 1000; gradient is (1, 0, -1).
logits = Tensor(np.array([[1000.0, 0.0, 0.0]]), requires_grad=True)
targets = np.array([2])
loss = cross_entropy(logits, targets)
loss.backward()
assert np.isclose(loss.data, 1000.0, atol=1e-3)
expected = np.array([[1.0, 0.0, -1.0]])
np.testing.assert_allclose(logits.grad, expected, atol=1e-9)
Constraints¶
softmaxalways uses the max trick. No naiveexp / sum(exp)anywhere in the forward.cross_entropynever materializeslog(softmax). Use LSE.matmulbackward usesswapaxes(-1, -2), not.T..Tonly transposes 2D arrays cleanly; for batched matmul we need to swap the last two dims while preserving batch dims.- Targets to
cross_entropyare integers, notTensors. Reject one-hot targets at the API boundary — log a loud error.
Test patterns¶
For each op, write:
1. Cross-check vs PyTorch FP64 at the typical shape pairs.
2. Gradcheck at a small shape (≤ (3, 4)).
3. Edge case (size-1 dim, scalar reduction, etc.).
4. Stress / stability test (logits with large magnitude for softmax / CE).
For cross_entropy, also test reduction='sum' and reduction='none'.
Stop conditions¶
Done when:
matmul,softmax,cross_entropyall implemented and tested.- Cross-check PyTorch FP64 green for every op shape combination.
- Gradcheck green for every op at
eps=1e-6, atol=1e-4. - Stability stress tests all green.
- End-to-end grammar test green.
mypy --strictclean acrosssrc/minitorch/.- You can re-derive
∂L/∂x = softmax(x) - one_hot(y)on a blank page, by index, in under 5 minutes.
Pitfalls¶
- Softmax backward formula confusion. Two equivalent forms: (a) full Jacobian
∂s_i/∂x_j = s_i(δ_{ij} - s_j)then matrix-multiply withdy; (b) the vector forms * (dy - sum(dy * s)). Form (b) isO(N); form (a) isO(N²). Use (b). Derive (b) from (a) once on paper. cross_entropybackward forgets the/ Bforreduction='mean'. Catches you when comparing to PyTorch — gradients off by exactly the batch size.- Batched matmul transpose.
np.matmul((B, M, N), (B, N, P))→(B, M, P). The "transpose" in backward is.swapaxes(-1, -2). If you wrote.T, batched cases will silently produce wrong shapes. Add a 3D batched test. - Softmax with
axis != -1. All the broadcasting in the backward formula assumesaxisis the last; for generalaxisyou needkeepdims=Trueconsistently. Test withaxis=0on a(3, 4)tensor. cross_entropywith a single example.logits.shape = (1, V),targets.shape = (1,). Common edge case; test it.- Targets dtype. PyTorch wants
torch.long; ourcross_entropyshould acceptnp.int64(orint32). Document.
When to consult solutions/¶
After all four blocks pass all tests, including the grammar end-to-end. solutions/03-matmul-softmax-ce-ref.md (at phase open) compares your three closures and the cross_entropy fused op against the canonical implementations. It also includes a one-page "if you spent more than 90 minutes debugging softmax backward, read this first" rescue page.
This is the last lab of Phase 8. When all three labs are green, the experiments/08-tense-classifier/ MLP becomes feasible — and that experiment is what closes the phase.