Skip to content

English · Español

Lab 01 — Elementwise ops with broadcasting-correct backward

Goal: add the family of elementwise ops to Tensor: add sub mul div neg exp log relu gelu tanh. Each forward respects NumPy broadcasting; each backward correctly sums along broadcast axes. Cross-check every op against PyTorch FP64 (tolerance 1e-7) and gradcheck (tolerance 1e-4 at FP64).

Estimated time: 3–4 hours (it's a lot of ops, with one tricky idea — broadcast reverse).

Prereqs: Lab 00. Theory 02 (02-tensor-op-derivatives.md) — re-read the broadcasting-reverse derivation.


What you produce

  • All 9 elementwise ops in src/minitorch/tensor.py, exposed via dunders and methods.
  • tests/test_elementwise.py with two test classes per op: PyTorch cross-check + gradcheck.
  • tests/test_broadcast_reverse.py with the broadcasting-pair tests (a (3,) op a (4,3), a (N,1) op a (1,M), etc.).

The broadcasting-reverse helper

Per theory/02, the helper:

def _unbroadcast(grad: FloatArray, target_shape: tuple[int, ...]) -> FloatArray:
    # If grad.shape is "wider" than target_shape on the left, sum out the extra leading dims.
    while grad.ndim > len(target_shape):
        grad = grad.sum(axis=0)
    # For each remaining axis, if target dim is 1 but grad dim > 1, sum that axis with keepdims=True.
    for i, t_dim in enumerate(target_shape):
        if t_dim == 1 and grad.shape[i] > 1:
            grad = grad.sum(axis=i, keepdims=True)
    return grad

Implement this once. Every elementwise op's _backward calls it before assigning into a parent's .grad.

🇪🇸 La trampa más común del lab: olvidar _unbroadcast y dejar un gradiente con shape distinta a la del parent. Al sumar parent.grad += new_grad NumPy difunde la suma silenciosamente. El resultado: una matriz donde esperabas un vector. Asegúrate de que tras cada _backward, parent.grad.shape == parent.data.shape.

TODOs (per op)

For each op, do the same five things:

  1. Forward: compute out_data with NumPy. Construct out = Tensor(out_data, _prev=(self, other), _op='<name>') and propagate requires_grad.
  2. Backward: define a closure capturing the parents and any cached values; on call, contribute _unbroadcast(local_grad, parent.shape) into each parent's .grad.
  3. PyTorch test: build the same expression in PyTorch FP64; compare gradients with np.testing.assert_allclose(..., rtol=1e-7).
  4. Gradcheck test: numerical FD at FP64; tolerance 1e-4.
  5. Broadcasting test: at least one shape pair where forward must broadcast.

Block A — additive family

  • __add__(self, other): forward self.data + other.data. Backward: self.grad += _unbroadcast(out.grad, self.shape); same for other (treat scalars/Python ints by wrapping in Tensor).
  • __radd__: for 2 + t.
  • __neg__(self): forward -self.data. Backward: self.grad += _unbroadcast(-out.grad, self.shape).
  • __sub__(self, other): reuse add + neg if you prefer; or implement directly. Document the choice.
  • __rsub__: for 2 - t.

Block B — multiplicative family

  • __mul__(self, other): forward self.data * other.data. Backward: self.grad += _unbroadcast(other.data * out.grad, self.shape); other.grad += _unbroadcast(self.data * out.grad, other.shape). Note the swap: ∂(a*b)/∂a = b.
  • __rmul__.
  • __truediv__(self, other): forward self.data / other.data. Backward: ∂(a/b)/∂a = 1/b, ∂(a/b)/∂b = -a/b². Cache other.data (do not cache 1/other.data — risk of div-by-zero in the cache).
  • __rtruediv__: for 2.0 / t.

Block C — unary nonlinearities (Tensor methods)

  • exp(self): forward np.exp(self.data). Backward: self.grad += _unbroadcast(out.data * out.grad, self.shape) — note we use out.data (which equals exp(self.data)).
  • log(self): forward np.log(self.data). Backward: self.grad += _unbroadcast(out.grad / self.data, self.shape). Assert self.data > 0 in the forward (raise loudly on invalid input).
  • relu(self): forward np.maximum(0, self.data). Backward: self.grad += _unbroadcast(out.grad * (self.data > 0).astype(self.data.dtype), self.shape). Sub-gradient at 0 = 0.
  • tanh(self): forward np.tanh(self.data). Backward: self.grad += _unbroadcast(out.grad * (1 - out.data**2), self.shape).
  • gelu(self): forward 0.5 * self.data * (1 + np.tanh(...)) (use the tanh approximation from theory). Backward: derivative is non-trivial — re-derive on paper before coding; cache intermediates.

Constraints

  • Functional only. No in-place ops.
  • No numpy.errstate silencing warnings. Let log(<=0) raise.
  • Every op gets a Spanish summary line in the eventual solutions/01-elementwise-ops-ref.md — keep your theory/02-tensor-op-derivatives.md open for the derivations.
  • Use Phase 6's seeding utilities in tests: seed_everything(42).

Test patterns

PyTorch cross-check (one example)

import torch, numpy as np
from minitorch import Tensor

def test_mul_pytorch_cross():
    rng = np.random.default_rng(42)
    a_data = rng.standard_normal((3, 4)).astype(np.float64)
    b_data = rng.standard_normal((3, 4)).astype(np.float64)
    # Our autograd.
    A = Tensor(a_data, requires_grad=True)
    B = Tensor(b_data, requires_grad=True)
    L = (A * B).sum()
    L.backward()
    # PyTorch reference.
    At = torch.tensor(a_data, dtype=torch.float64, requires_grad=True)
    Bt = torch.tensor(b_data, dtype=torch.float64, requires_grad=True)
    Lt = (At * Bt).sum()
    Lt.backward()
    np.testing.assert_allclose(A.grad, At.grad.numpy(), rtol=1e-7)
    np.testing.assert_allclose(B.grad, Bt.grad.numpy(), rtol=1e-7)

Write one of these per op. ~10 lines each.

Gradcheck (one example)

def test_mul_gradcheck():
    rng = np.random.default_rng(0)
    a = Tensor(rng.standard_normal((2, 3)), requires_grad=True)
    b = Tensor(rng.standard_normal((2, 3)), requires_grad=True)
    f = lambda: (a * b).sum()
    assert gradcheck_pair(f, a, b, eps=1e-6, atol=1e-4)

Broadcasting test (one example, grammar-flavoured)

def test_mul_broadcasts_person_logits():
    # (3,) person one-hot times (3, 5) (person, tense) logits.
    person = Tensor(np.array([0.0, 1.0, 0.0]), requires_grad=True)
    logits = Tensor(np.random.default_rng(7).standard_normal((3, 5)), requires_grad=True)
    out = (person[:, None] * logits).sum()
    out.backward()
    assert person.grad.shape == (3,)        # _unbroadcast restored
    assert logits.grad.shape == (3, 5)

Stop conditions

Done when:

  1. All 9 elementwise ops implemented.
  2. Each has at least 2 tests: PyTorch cross + gradcheck. All green.
  3. Each has at least 1 broadcasting test. All green.
  4. mypy --strict clean.
  5. You can answer: "if I forget _unbroadcast in __add__'s backward, what's the smallest test that catches it?"

Pitfalls

  • out.grad is None on the leaf. The very first _backward call is root._backward() after root.grad = np.array(1.0). If you wrote _backward to read out.grad, fine. If you accidentally captured out itself and then read out.grad before it was set by the previous reverse step, you get None. Trace order carefully.
  • Wrong sign in __rsub__ / __rtruediv__. 2 - t is (-t) + 2, not t - 2. Easy to flip.
  • Cached 1/other.data blows up. Keep other.data and divide in the closure; or guard against zero.
  • relu's gradient at exactly 0. Convention: 0. Don't accidentally use (self.data >= 0) — that's a different convention.
  • gelu's forward and backward don't match. Two GELU formulations (exact via erf and tanh-approximation). Pick one and use it consistently.

When to consult solutions/

After all 9 ops pass all three test families. solutions/01-elementwise-ops-ref.md (at phase open) compares your closures against the canonical implementations.


Next lab: lab/02-reduction-and-shape-ops.md.