English · Español

Lab 01 — Elementwise ops with broadcasting-correct backward¶

Goal: add the family of elementwise ops to Tensor: add sub mul div neg exp log relu gelu tanh. Each forward respects NumPy broadcasting; each backward correctly sums along broadcast axes. Cross-check every op against PyTorch FP64 (tolerance 1e-7) and gradcheck (tolerance 1e-4 at FP64).

Estimated time: 3–4 hours (it's a lot of ops, with one tricky idea — broadcast reverse).

Prereqs: Lab 00. Theory 02 (02-tensor-op-derivatives.md) — re-read the broadcasting-reverse derivation.

What you produce¶

All 9 elementwise ops in src/minitorch/tensor.py, exposed via dunders and methods.
tests/test_elementwise.py with two test classes per op: PyTorch cross-check + gradcheck.
tests/test_broadcast_reverse.py with the broadcasting-pair tests (a (3,) op a (4,3), a (N,1) op a (1,M), etc.).

The broadcasting-reverse helper¶

Per theory/02, the helper:

def _unbroadcast(grad: FloatArray, target_shape: tuple[int, ...]) -> FloatArray:
    # If grad.shape is "wider" than target_shape on the left, sum out the extra leading dims.
    while grad.ndim > len(target_shape):
        grad = grad.sum(axis=0)
    # For each remaining axis, if target dim is 1 but grad dim > 1, sum that axis with keepdims=True.
    for i, t_dim in enumerate(target_shape):
        if t_dim == 1 and grad.shape[i] > 1:
            grad = grad.sum(axis=i, keepdims=True)
    return grad

Implement this once. Every elementwise op's _backward calls it before assigning into a parent's .grad.

🇪🇸 La trampa más común del lab: olvidar _unbroadcast y dejar un gradiente con shape distinta a la del parent. Al sumar parent.grad += new_grad NumPy difunde la suma silenciosamente. El resultado: una matriz donde esperabas un vector. Asegúrate de que tras cada _backward, parent.grad.shape == parent.data.shape.

TODOs (per op)¶

For each op, do the same five things:

Forward: compute out_data with NumPy. Construct out = Tensor(out_data, _prev=(self, other), _op='<name>') and propagate requires_grad.
Backward: define a closure capturing the parents and any cached values; on call, contribute _unbroadcast(local_grad, parent.shape) into each parent's .grad.
PyTorch test: build the same expression in PyTorch FP64; compare gradients with np.testing.assert_allclose(..., rtol=1e-7).
Gradcheck test: numerical FD at FP64; tolerance 1e-4.
Broadcasting test: at least one shape pair where forward must broadcast.

Block A — additive family¶

__add__(self, other): forward self.data + other.data. Backward: self.grad += _unbroadcast(out.grad, self.shape); same for other (treat scalars/Python ints by wrapping in Tensor).
__radd__: for 2 + t.
__neg__(self): forward -self.data. Backward: self.grad += _unbroadcast(-out.grad, self.shape).
__sub__(self, other): reuse add + neg if you prefer; or implement directly. Document the choice.
__rsub__: for 2 - t.

Block B — multiplicative family¶

__mul__(self, other): forward self.data * other.data. Backward: self.grad += _unbroadcast(other.data * out.grad, self.shape); other.grad += _unbroadcast(self.data * out.grad, other.shape). Note the swap: ∂(a*b)/∂a = b.
__rmul__.
__truediv__(self, other): forward self.data / other.data. Backward: ∂(a/b)/∂a = 1/b, ∂(a/b)/∂b = -a/b². Cache other.data (do not cache 1/other.data — risk of div-by-zero in the cache).
__rtruediv__: for 2.0 / t.

Block C — unary nonlinearities (Tensor methods)¶

exp(self): forward np.exp(self.data). Backward: self.grad += _unbroadcast(out.data * out.grad, self.shape) — note we use out.data (which equals exp(self.data)).
log(self): forward np.log(self.data). Backward: self.grad += _unbroadcast(out.grad / self.data, self.shape). Assert self.data > 0 in the forward (raise loudly on invalid input).
relu(self): forward np.maximum(0, self.data). Backward: self.grad += _unbroadcast(out.grad * (self.data > 0).astype(self.data.dtype), self.shape). Sub-gradient at 0 = 0.
tanh(self): forward np.tanh(self.data). Backward: self.grad += _unbroadcast(out.grad * (1 - out.data**2), self.shape).
gelu(self): forward 0.5 * self.data * (1 + np.tanh(...)) (use the tanh approximation from theory). Backward: derivative is non-trivial — re-derive on paper before coding; cache intermediates.

Constraints¶

Functional only. No in-place ops.
No numpy.errstate silencing warnings. Let log(<=0) raise.
Every op gets a Spanish summary line in the eventual solutions/01-elementwise-ops-ref.md — keep your theory/02-tensor-op-derivatives.md open for the derivations.
Use Phase 6's seeding utilities in tests: seed_everything(42).

Test patterns¶

PyTorch cross-check (one example)¶

import torch, numpy as np
from minitorch import Tensor

def test_mul_pytorch_cross():
    rng = np.random.default_rng(42)
    a_data = rng.standard_normal((3, 4)).astype(np.float64)
    b_data = rng.standard_normal((3, 4)).astype(np.float64)
    # Our autograd.
    A = Tensor(a_data, requires_grad=True)
    B = Tensor(b_data, requires_grad=True)
    L = (A * B).sum()
    L.backward()
    # PyTorch reference.
    At = torch.tensor(a_data, dtype=torch.float64, requires_grad=True)
    Bt = torch.tensor(b_data, dtype=torch.float64, requires_grad=True)
    Lt = (At * Bt).sum()
    Lt.backward()
    np.testing.assert_allclose(A.grad, At.grad.numpy(), rtol=1e-7)
    np.testing.assert_allclose(B.grad, Bt.grad.numpy(), rtol=1e-7)

Write one of these per op. ~10 lines each.

Gradcheck (one example)¶

def test_mul_gradcheck():
    rng = np.random.default_rng(0)
    a = Tensor(rng.standard_normal((2, 3)), requires_grad=True)
    b = Tensor(rng.standard_normal((2, 3)), requires_grad=True)
    f = lambda: (a * b).sum()
    assert gradcheck_pair(f, a, b, eps=1e-6, atol=1e-4)

Broadcasting test (one example, grammar-flavoured)¶

def test_mul_broadcasts_person_logits():
    # (3,) person one-hot times (3, 5) (person, tense) logits.
    person = Tensor(np.array([0.0, 1.0, 0.0]), requires_grad=True)
    logits = Tensor(np.random.default_rng(7).standard_normal((3, 5)), requires_grad=True)
    out = (person[:, None] * logits).sum()
    out.backward()
    assert person.grad.shape == (3,)        # _unbroadcast restored
    assert logits.grad.shape == (3, 5)

Stop conditions¶

Done when:

All 9 elementwise ops implemented.
Each has at least 2 tests: PyTorch cross + gradcheck. All green.
Each has at least 1 broadcasting test. All green.
mypy --strict clean.
You can answer: "if I forget _unbroadcast in __add__'s backward, what's the smallest test that catches it?"

Pitfalls¶

out.grad is None on the leaf. The very first _backward call is root._backward() after root.grad = np.array(1.0). If you wrote _backward to read out.grad, fine. If you accidentally captured out itself and then read out.grad before it was set by the previous reverse step, you get None. Trace order carefully.
Wrong sign in __rsub__ / __rtruediv__. 2 - t is (-t) + 2, not t - 2. Easy to flip.
Cached 1/other.data blows up. Keep other.data and divide in the closure; or guard against zero.
relu's gradient at exactly 0. Convention: 0. Don't accidentally use (self.data >= 0) — that's a different convention.
gelu's forward and backward don't match. Two GELU formulations (exact via erf and tanh-approximation). Pick one and use it consistently.

When to consult `solutions/`¶

After all 9 ops pass all three test families. solutions/01-elementwise-ops-ref.md (at phase open) compares your closures against the canonical implementations.

Next lab: lab/02-reduction-and-shape-ops.md.