English · Español
Lab 01 — Elementwise ops with broadcasting-correct backward¶
Goal: add the family of elementwise ops to
Tensor:add sub mul div neg exp log relu gelu tanh. Each forward respects NumPy broadcasting; each backward correctly sums along broadcast axes. Cross-check every op against PyTorch FP64 (tolerance 1e-7) and gradcheck (tolerance 1e-4 at FP64).Estimated time: 3–4 hours (it's a lot of ops, with one tricky idea — broadcast reverse).
Prereqs: Lab 00. Theory 02 (
02-tensor-op-derivatives.md) — re-read the broadcasting-reverse derivation.
What you produce¶
- All 9 elementwise ops in
src/minitorch/tensor.py, exposed via dunders and methods. tests/test_elementwise.pywith two test classes per op: PyTorch cross-check + gradcheck.tests/test_broadcast_reverse.pywith the broadcasting-pair tests (a(3,)op a(4,3), a(N,1)op a(1,M), etc.).
The broadcasting-reverse helper¶
Per theory/02, the helper:
def _unbroadcast(grad: FloatArray, target_shape: tuple[int, ...]) -> FloatArray:
# If grad.shape is "wider" than target_shape on the left, sum out the extra leading dims.
while grad.ndim > len(target_shape):
grad = grad.sum(axis=0)
# For each remaining axis, if target dim is 1 but grad dim > 1, sum that axis with keepdims=True.
for i, t_dim in enumerate(target_shape):
if t_dim == 1 and grad.shape[i] > 1:
grad = grad.sum(axis=i, keepdims=True)
return grad
Implement this once. Every elementwise op's _backward calls it before assigning into a parent's .grad.
🇪🇸 La trampa más común del lab: olvidar
_unbroadcasty dejar un gradiente con shape distinta a la del parent. Al sumarparent.grad += new_gradNumPy difunde la suma silenciosamente. El resultado: una matriz donde esperabas un vector. Asegúrate de que tras cada_backward,parent.grad.shape == parent.data.shape.
TODOs (per op)¶
For each op, do the same five things:
- Forward: compute
out_datawith NumPy. Constructout = Tensor(out_data, _prev=(self, other), _op='<name>')and propagaterequires_grad. - Backward: define a closure capturing the parents and any cached values; on call, contribute
_unbroadcast(local_grad, parent.shape)into each parent's.grad. - PyTorch test: build the same expression in PyTorch FP64; compare gradients with
np.testing.assert_allclose(..., rtol=1e-7). - Gradcheck test: numerical FD at FP64; tolerance 1e-4.
- Broadcasting test: at least one shape pair where forward must broadcast.
Block A — additive family¶
-
__add__(self, other): forwardself.data + other.data. Backward:self.grad += _unbroadcast(out.grad, self.shape); same forother(treat scalars/Python ints by wrapping inTensor). -
__radd__: for2 + t. -
__neg__(self): forward-self.data. Backward:self.grad += _unbroadcast(-out.grad, self.shape). -
__sub__(self, other): reuseadd+negif you prefer; or implement directly. Document the choice. -
__rsub__: for2 - t.
Block B — multiplicative family¶
-
__mul__(self, other): forwardself.data * other.data. Backward:self.grad += _unbroadcast(other.data * out.grad, self.shape);other.grad += _unbroadcast(self.data * out.grad, other.shape). Note the swap:∂(a*b)/∂a = b. -
__rmul__. -
__truediv__(self, other): forwardself.data / other.data. Backward:∂(a/b)/∂a = 1/b,∂(a/b)/∂b = -a/b². Cacheother.data(do not cache1/other.data— risk of div-by-zero in the cache). -
__rtruediv__: for2.0 / t.
Block C — unary nonlinearities (Tensor methods)¶
-
exp(self): forwardnp.exp(self.data). Backward:self.grad += _unbroadcast(out.data * out.grad, self.shape)— note we useout.data(which equalsexp(self.data)). -
log(self): forwardnp.log(self.data). Backward:self.grad += _unbroadcast(out.grad / self.data, self.shape). Assertself.data > 0in the forward (raise loudly on invalid input). -
relu(self): forwardnp.maximum(0, self.data). Backward:self.grad += _unbroadcast(out.grad * (self.data > 0).astype(self.data.dtype), self.shape). Sub-gradient at 0 = 0. -
tanh(self): forwardnp.tanh(self.data). Backward:self.grad += _unbroadcast(out.grad * (1 - out.data**2), self.shape). -
gelu(self): forward0.5 * self.data * (1 + np.tanh(...))(use the tanh approximation from theory). Backward: derivative is non-trivial — re-derive on paper before coding; cache intermediates.
Constraints¶
- Functional only. No in-place ops.
- No
numpy.errstatesilencing warnings. Letlog(<=0)raise. - Every op gets a Spanish summary line in the eventual
solutions/01-elementwise-ops-ref.md— keep yourtheory/02-tensor-op-derivatives.mdopen for the derivations. - Use Phase 6's seeding utilities in tests:
seed_everything(42).
Test patterns¶
PyTorch cross-check (one example)¶
import torch, numpy as np
from minitorch import Tensor
def test_mul_pytorch_cross():
rng = np.random.default_rng(42)
a_data = rng.standard_normal((3, 4)).astype(np.float64)
b_data = rng.standard_normal((3, 4)).astype(np.float64)
# Our autograd.
A = Tensor(a_data, requires_grad=True)
B = Tensor(b_data, requires_grad=True)
L = (A * B).sum()
L.backward()
# PyTorch reference.
At = torch.tensor(a_data, dtype=torch.float64, requires_grad=True)
Bt = torch.tensor(b_data, dtype=torch.float64, requires_grad=True)
Lt = (At * Bt).sum()
Lt.backward()
np.testing.assert_allclose(A.grad, At.grad.numpy(), rtol=1e-7)
np.testing.assert_allclose(B.grad, Bt.grad.numpy(), rtol=1e-7)
Write one of these per op. ~10 lines each.
Gradcheck (one example)¶
def test_mul_gradcheck():
rng = np.random.default_rng(0)
a = Tensor(rng.standard_normal((2, 3)), requires_grad=True)
b = Tensor(rng.standard_normal((2, 3)), requires_grad=True)
f = lambda: (a * b).sum()
assert gradcheck_pair(f, a, b, eps=1e-6, atol=1e-4)
Broadcasting test (one example, grammar-flavoured)¶
def test_mul_broadcasts_person_logits():
# (3,) person one-hot times (3, 5) (person, tense) logits.
person = Tensor(np.array([0.0, 1.0, 0.0]), requires_grad=True)
logits = Tensor(np.random.default_rng(7).standard_normal((3, 5)), requires_grad=True)
out = (person[:, None] * logits).sum()
out.backward()
assert person.grad.shape == (3,) # _unbroadcast restored
assert logits.grad.shape == (3, 5)
Stop conditions¶
Done when:
- All 9 elementwise ops implemented.
- Each has at least 2 tests: PyTorch cross + gradcheck. All green.
- Each has at least 1 broadcasting test. All green.
mypy --strictclean.- You can answer: "if I forget
_unbroadcastin__add__'s backward, what's the smallest test that catches it?"
Pitfalls¶
out.gradisNoneon the leaf. The very first_backwardcall isroot._backward()afterroot.grad = np.array(1.0). If you wrote_backwardto readout.grad, fine. If you accidentally capturedoutitself and then readout.gradbefore it was set by the previous reverse step, you getNone. Trace order carefully.- Wrong sign in
__rsub__/__rtruediv__.2 - tis(-t) + 2, nott - 2. Easy to flip. - Cached
1/other.datablows up. Keepother.dataand divide in the closure; or guard against zero. relu's gradient at exactly 0. Convention: 0. Don't accidentally use(self.data >= 0)— that's a different convention.gelu's forward and backward don't match. Two GELU formulations (exact viaerfand tanh-approximation). Pick one and use it consistently.
When to consult solutions/¶
After all 9 ops pass all three test families. solutions/01-elementwise-ops-ref.md (at phase open) compares your closures against the canonical implementations.
Next lab: lab/02-reduction-and-shape-ops.md.