Skip to content

English · Español

Lab 03 — matmul, softmax, cross_entropy: the three high-stakes ops

Goal: implement the three ops that the rest of the curriculum depends on. matmul is the workhorse of every layer; softmax is the workhorse of every classifier; cross_entropy (from logits) is the loss that drives Phase 9's MLP and Phase 17's transformer. Get them right here — at FP64, with both oracles green — and Phase 9+ inherits a trustworthy foundation.

Estimated time: 4–5 hours. This is the longest lab in the phase.

Prereqs: Labs 00 + 01 + 02. Theory 03-matmul-and-softmax-grads.md re-read end-to-end. Phase 2's stable_softmax and stable_cross_entropy open in another window — we'll reuse the intuition, not the code.


What you produce

  • Three ops added to src/minitorch/tensor.py: Tensor.matmul, Tensor.softmax, module-level cross_entropy(logits, targets, reduction='mean').
  • The @ operator wired up: A @ B calls A.matmul(B).
  • tests/test_matmul.py, tests/test_softmax.py, tests/test_cross_entropy.py — each with PyTorch cross + gradcheck + edge cases.
  • A grammar-flavoured end-to-end gradient test: (one_hot_person @ tense_logits → softmax → cross_entropy) matches PyTorch.

Why these three are "high-stakes"

Bugs in add show up at the next test boundary. Bugs in matmul/softmax/cross_entropy show up as silent training failures three days later: model trains, loss decreases, validation accuracy plateaus 5% below where it should be. The Phase 8 testing strategy exists primarily to catch bugs in these three ops before they pollute Phase 9–22.

🇪🇸 Las tres operaciones que no te puedes permitir tener mal: matmul (sesgo silencioso en cualquier capa lineal), softmax (overflow/underflow numérico), cross-entropy (la fórmula bonita softmax - one_hot es trivial de derivar pero fácil de implementar mal en el caso batched). Si las tres pasan PyTorch cross-check y gradcheck, el resto del curriculum descansa sobre cimientos sólidos.

TODOs

Block A — matmul

  • __matmul__(self, other) -> Tensor: forward self.data @ other.data. Backward by the derivation in theory/03:
  • self.grad += out.grad @ other.data.swapaxes(-1, -2)
  • other.grad += self.data.swapaxes(-1, -2) @ out.grad
  • Use _unbroadcast on each: matmul broadcasts batch dims. A.shape = (1, 3, 4) @ B.shape = (B, 4, 5) produces (B, 3, 5). The (1, 3, 4) parent needs the gradient summed over the batch dim.
  • Wire __matmul__ so that A @ B works.
  • Test matrix: 2D × 2D, 2D × 1D (vector), 1D × 2D, batched 3D × 3D, batched 3D × 2D (with broadcast).

Block B — softmax

  • softmax(self, axis=-1) -> Tensor: forward uses the max-subtraction stability trick (Phase 2):
    shifted = self.data - self.data.max(axis=axis, keepdims=True)
    exps = np.exp(shifted)
    out_data = exps / exps.sum(axis=axis, keepdims=True)
    
    Backward by the Jacobian-vector product form (theory 03):
    # out.data is `s` (the softmax output, shape == self.shape)
    # out.grad is `dy`
    s = out.data
    dy = out.grad
    # dx = s * (dy - (dy * s).sum(axis=axis, keepdims=True))
    
    Derive on paper why this is the dense Jacobian collapsed to a vector formula. Spend 20 minutes on the derivation; the result is one of those identities that becomes "obvious" once seen.
  • Test against PyTorch on rank-1, rank-2, rank-3 tensors. Stress test with logits of magnitude 1e3 (must not overflow).

Block C — cross_entropy from logits

  • cross_entropy(logits: Tensor, targets: IntArray, reduction: str = 'mean') -> Tensor (module-level function):
  • targets is a plain np.ndarray of integers (not a Tensor) — targets don't have gradients.
  • Forward uses the log-sum-exp trick (Phase 2's stable_cross_entropy):
    # logits.shape = (B, V); targets.shape = (B,)
    lse = log_sum_exp(logits.data, axis=-1)  # shape (B,)
    target_logit = np.take_along_axis(logits.data, targets[:, None], axis=-1).squeeze(-1)
    losses = lse - target_logit              # shape (B,)
    if reduction == 'mean':
        out_data = losses.mean()
    elif reduction == 'sum':
        out_data = losses.sum()
    elif reduction == 'none':
        out_data = losses
    
  • Backward — the one beautiful identity:
    # Cache the softmax probabilities `p` (NOT the logits' exps — recompute stably).
    p = softmax(logits.data, axis=-1)        # stable
    grad = p.copy()
    np.add.at(grad, (np.arange(B), targets), -1.0)  # subtract one-hot
    if reduction == 'mean':
        grad /= B
    # multiply by upstream out.grad (scalar for mean/sum, vector for none)
    logits.grad += grad * upstream
    
  • Do not implement cross_entropy as (-target_log_softmax).mean() chained from existing ops. The combined op is numerically stable and avoids materializing log(softmax). Implement as a single fused op. (This mirrors PyTorch's F.cross_entropy.)

Block D — end-to-end grammar test

After Blocks A–C work in isolation, build the grammar mini-graph and verify against PyTorch.

def test_grammar_pipeline_matches_pytorch():
    # The §A13 baseline: select a person, project against tense logits, classify.
    rng = np.random.default_rng(42)
    person_onehot = np.array([0.0, 1.0, 0.0])           # "you" (2nd person)
    W = rng.standard_normal((3, 5))                     # person → tense
    targets = np.array([2])                             # past-simple
    # ours
    P = Tensor(person_onehot[None, :], requires_grad=False)  # (1, 3)
    Wt = Tensor(W, requires_grad=True)                       # (3, 5)
    logits = P @ Wt                                          # (1, 5)
    loss = cross_entropy(logits, targets)
    loss.backward()
    # pytorch
    Pt = torch.tensor(person_onehot[None, :], dtype=torch.float64)
    Wpt = torch.tensor(W, dtype=torch.float64, requires_grad=True)
    logits_t = Pt @ Wpt
    loss_t = torch.nn.functional.cross_entropy(logits_t, torch.tensor(targets, dtype=torch.long))
    loss_t.backward()
    np.testing.assert_allclose(Wt.grad, Wpt.grad.numpy(), rtol=1e-7)

Numerical stability stress tests

These tests must pass. They're the difference between an autograd that works on the textbook example and one that works in Phase 17's transformer.

def test_softmax_large_logits():
    # Logits with magnitude 1e3 must not overflow.
    x = Tensor(np.array([1000.0, 1001.0, 999.0]), requires_grad=True)
    y = x.softmax()
    y.sum().backward()
    assert np.all(np.isfinite(y.data))
    assert np.all(np.isfinite(x.grad))
    assert np.isclose(y.data.sum(), 1.0)

def test_cross_entropy_confident_correct():
    # Confident, correct prediction → ~0 loss; gradient is ~0.
    logits = Tensor(np.array([[1000.0, 0.0, 0.0]]), requires_grad=True)
    targets = np.array([0])
    loss = cross_entropy(logits, targets)
    loss.backward()
    assert loss.data < 1e-10
    assert np.allclose(logits.grad, np.array([[0.0, 0.0, 0.0]]), atol=1e-9)

def test_cross_entropy_confident_wrong():
    # Confident, wrong prediction → loss ≈ 1000; gradient is (1, 0, -1).
    logits = Tensor(np.array([[1000.0, 0.0, 0.0]]), requires_grad=True)
    targets = np.array([2])
    loss = cross_entropy(logits, targets)
    loss.backward()
    assert np.isclose(loss.data, 1000.0, atol=1e-3)
    expected = np.array([[1.0, 0.0, -1.0]])
    np.testing.assert_allclose(logits.grad, expected, atol=1e-9)

Constraints

  • softmax always uses the max trick. No naive exp / sum(exp) anywhere in the forward.
  • cross_entropy never materializes log(softmax). Use LSE.
  • matmul backward uses swapaxes(-1, -2), not .T. .T only transposes 2D arrays cleanly; for batched matmul we need to swap the last two dims while preserving batch dims.
  • Targets to cross_entropy are integers, not Tensors. Reject one-hot targets at the API boundary — log a loud error.

Test patterns

For each op, write: 1. Cross-check vs PyTorch FP64 at the typical shape pairs. 2. Gradcheck at a small shape (≤ (3, 4)). 3. Edge case (size-1 dim, scalar reduction, etc.). 4. Stress / stability test (logits with large magnitude for softmax / CE).

For cross_entropy, also test reduction='sum' and reduction='none'.

Stop conditions

Done when:

  1. matmul, softmax, cross_entropy all implemented and tested.
  2. Cross-check PyTorch FP64 green for every op shape combination.
  3. Gradcheck green for every op at eps=1e-6, atol=1e-4.
  4. Stability stress tests all green.
  5. End-to-end grammar test green.
  6. mypy --strict clean across src/minitorch/.
  7. You can re-derive ∂L/∂x = softmax(x) - one_hot(y) on a blank page, by index, in under 5 minutes.

Pitfalls

  • Softmax backward formula confusion. Two equivalent forms: (a) full Jacobian ∂s_i/∂x_j = s_i(δ_{ij} - s_j) then matrix-multiply with dy; (b) the vector form s * (dy - sum(dy * s)). Form (b) is O(N); form (a) is O(N²). Use (b). Derive (b) from (a) once on paper.
  • cross_entropy backward forgets the / B for reduction='mean'. Catches you when comparing to PyTorch — gradients off by exactly the batch size.
  • Batched matmul transpose. np.matmul((B, M, N), (B, N, P))(B, M, P). The "transpose" in backward is .swapaxes(-1, -2). If you wrote .T, batched cases will silently produce wrong shapes. Add a 3D batched test.
  • Softmax with axis != -1. All the broadcasting in the backward formula assumes axis is the last; for general axis you need keepdims=True consistently. Test with axis=0 on a (3, 4) tensor.
  • cross_entropy with a single example. logits.shape = (1, V), targets.shape = (1,). Common edge case; test it.
  • Targets dtype. PyTorch wants torch.long; our cross_entropy should accept np.int64 (or int32). Document.

When to consult solutions/

After all four blocks pass all tests, including the grammar end-to-end. solutions/03-matmul-softmax-ce-ref.md (at phase open) compares your three closures and the cross_entropy fused op against the canonical implementations. It also includes a one-page "if you spent more than 90 minutes debugging softmax backward, read this first" rescue page.


This is the last lab of Phase 8. When all three labs are green, the experiments/08-tense-classifier/ MLP becomes feasible — and that experiment is what closes the phase.