English · Español
Lab 02 — Reduction and shape ops¶
Goal: add the reduction + shape family to
Tensor:sum mean reshape transpose broadcast_to getitem cat stack. The "shape" ops are no-ops in the math sense (local Jacobian is identity) but require careful index bookkeeping in backward. The "reduction" ops are the converse: math is trivial, but the backward mustbroadcast_tothe upstream gradient back to the input shape.Estimated time: 3–4 hours.
Prereqs: Labs 00 + 01. Theory
02-tensor-op-derivatives.mdre-read (especially the reduction backward derivation).
What you produce¶
- All 7 ops added to
src/minitorch/tensor.py. tests/test_reductions.pywith cross-check + gradcheck per reduction op.tests/test_shape_ops.pywith shape-roundtrip tests + gradcheck per shape op.
The shape-contract reminder¶
After every _backward, parent.grad.shape == parent.data.shape. For reduction ops, this means broadcasting the upstream gradient back to the input shape. For shape ops, this means inverting the shape transformation on the gradient.
🇪🇸 La regla mental: una
reductioncolapsa ejes hacia adelante, y los expande hacia atrás; unareshapereorganiza el layout, y el backward solo desreorganiza. No hay álgebra nueva, solo cuidar el shape.
TODOs (per op)¶
For each op below, do the five things from Lab 01 (forward, backward, PyTorch test, gradcheck test, shape-edge test). Shape ops do not need broadcasting tests.
Block A — reductions¶
-
sum(self, axis=None, keepdims=False): forwardself.data.sum(axis=..., keepdims=...). Backward: the upstream gradient has shape equal to the forward output; you mustbroadcast_to(self.shape)after potentially re-inserting the reduced axes. - Implementation hint: if
keepdims=Falseandaxisis notNone, the upstreamout.gradhas rankself.data.ndim - len(axes). Usenp.expand_dimson each reduced axis to restore the reduced dims as size-1, thennp.broadcast_to(self.shape). -
When
axis=None, the output is a scalar; backward isnp.broadcast_to(out.grad, self.shape). -
mean(self, axis=None, keepdims=False): forwardself.data.mean(axis=..., keepdims=...). Backward identical tosumbut divide byN, whereNis the number of elements reduced. CacheNin the closure.
Block B — pure shape ops¶
-
reshape(self, new_shape: tuple[int, ...]): forwardself.data.reshape(new_shape). Backwardself.grad += out.grad.reshape(self.shape). No size change, no math. -
transpose(self, axes: tuple[int, ...] | None = None): forwardself.data.transpose(axes). Backward: transpose the upstream by the inverse permutation. Hint: ifaxes = (1, 0, 2), the inverse isargsort(axes) = (1, 0, 2)(involution here); in general usenp.argsort(axes). -
broadcast_to(self, shape: tuple[int, ...]): forwardnp.broadcast_to(self.data, shape)(returns a view; if your downstream ops mutate, make a copy — but they don't mutate, so view is fine). Backward: this is the only shape op that needs_unbroadcast— the gradient comes back atshapeand must be reduced toself.shape.
Block C — indexing/concat¶
-
__getitem__(self, key): forwardself.data[key]. Backward: allocatenp.zeros_like(self.data), scatter the upstream gradient atkey. Hint:grad[key] += out.gradworks for basic slices and fancy indexing without duplicate indices. Ifkeycould contain duplicates (e.g.self[[0, 0, 1]]), usenp.add.at(grad, key, out.grad)to ensure accumulation. -
cat(tensors: list[Tensor], axis: int)(module-level function, not a method): forwardnp.concatenate([t.data for t in tensors], axis=axis). Backward: slice the upstream gradient alongaxisaccording to each tensor's size on that axis; contribute each slice to the matching parent. -
stack(tensors: list[Tensor], axis: int): forwardnp.stack([t.data for t in tensors], axis=axis). Backward: split along the new axis and squeeze.
Tricky cases to cover¶
Case 1 — sum with keepdims=False over a single axis¶
x = Tensor(np.random.randn(3, 4), requires_grad=True)
y = x.sum(axis=0) # shape (4,)
y.sum().backward()
assert x.grad.shape == (3, 4)
assert np.allclose(x.grad, 1.0) # uniform gradient
If you forget to re-insert axis=0 before broadcasting, out.grad of shape (4,) broadcasts to (3, 4) only because NumPy aligns it on the right by default — but for axis=1 (output shape (3,)), the alignment would be wrong and you'd get the wrong gradient.
Case 2 — mean over axis=None¶
x = Tensor(np.random.randn(2, 3), requires_grad=True)
y = x.mean()
y.backward()
assert np.allclose(x.grad, 1.0 / 6) # uniform 1/N gradient
Case 3 — getitem with duplicate indices¶
x = Tensor(np.array([1.0, 2.0, 3.0]), requires_grad=True)
y = x[[0, 0, 1]] # shape (3,) selecting x[0] twice and x[1] once
y.sum().backward()
assert np.allclose(x.grad, [2.0, 1.0, 0.0]) # x[0] contributes twice
If you wrote x.grad[key] += out.grad, you'd get [1.0, 1.0, 0.0] (the duplicate is overwritten, not accumulated). Use np.add.at.
Case 4 — transpose of a rank-3 tensor with a non-trivial permutation¶
x = Tensor(np.random.randn(2, 3, 4), requires_grad=True)
y = x.transpose((2, 0, 1)) # shape (4, 2, 3)
y.sum().backward()
assert x.grad.shape == (2, 3, 4)
The inverse permutation: np.argsort((2, 0, 1)) = (1, 2, 0). Backward applies out.grad.transpose((1, 2, 0)) to recover the (2, 3, 4) shape.
Case 5 — cat of two tensors with different sizes on the cat axis¶
a = Tensor(np.random.randn(2, 3), requires_grad=True)
b = Tensor(np.random.randn(2, 5), requires_grad=True)
y = cat([a, b], axis=1) # shape (2, 8)
y.sum().backward()
assert a.grad.shape == (2, 3)
assert b.grad.shape == (2, 5)
The backward must remember the per-tensor size along the cat axis to slice correctly. Cache [t.shape[axis] for t in tensors] in the closure.
Constraints¶
- No new external deps. NumPy + standard library only.
- No in-place ops.
reshapealways returns a newTensor(itsdatamay be a view, but the Tensor identity is new — important for the DAG). broadcast_toreturns a view ofself.data. Document this. The forward shares memory with the parent; the backward re-allocates via_unbroadcast. This is the pattern PyTorch uses withexpand(vsrepeat, which copies).
Test patterns¶
Reduction cross-check (one example)¶
def test_sum_axis_pytorch_cross():
rng = np.random.default_rng(42)
a = rng.standard_normal((4, 5)).astype(np.float64)
A = Tensor(a, requires_grad=True)
L = A.sum(axis=0).sum() # scalar
L.backward()
At = torch.tensor(a, dtype=torch.float64, requires_grad=True)
Lt = At.sum(dim=0).sum()
Lt.backward()
np.testing.assert_allclose(A.grad, At.grad.numpy(), rtol=1e-7)
Shape gradcheck (one example, grammar-flavoured)¶
def test_reshape_grammar_grid_gradcheck():
# (3, 5) person-tense logits flattened to (15,) and back.
rng = np.random.default_rng(0)
grid = Tensor(rng.standard_normal((3, 5)), requires_grad=True)
f = lambda: grid.reshape((15,)).sum()
assert gradcheck(f, grid, eps=1e-6, atol=1e-4)
Indexing test (grammar-flavoured)¶
def test_getitem_select_third_person():
# Pick out third-person row from (3, 5) logits.
logits = Tensor(np.random.default_rng(0).standard_normal((3, 5)), requires_grad=True)
out = logits[2].sum() # third person, all tenses
out.backward()
expected = np.zeros((3, 5))
expected[2] = 1.0
np.testing.assert_allclose(logits.grad, expected)
Stop conditions¶
Done when:
- All 7 ops implemented.
- Each has ≥ 2 tests (PyTorch cross + gradcheck). All green.
- Tricky cases 1–5 all green.
mypy --strictclean.catandstackare typed as module-levelCallable[[list[Tensor], int], Tensor].- You can answer: "why does
broadcast_to.backwardneed_unbroadcastbutreshape.backwarddoesn't?"
Pitfalls¶
keepdims=Falsereduction backward. The upstream gradient is rank-reduced; you must re-insert the reduced axes as size-1 before broadcasting. Forgetting this gives a silently wrong gradient (wrong sum direction).transposeinverse permutation.np.argsort(axes)is the correct inverse for any permutation. Hand-deriving the inverse is error-prone — useargsort.getitemwith boolean masks. Usenp.add.atfor the backward; assignment-with-mask works in forward but is fragile in backward.cat/stackparent identity. If the same tensor appears in the input list twice (cat([a, a], axis=0)),a.gradmust accumulate from both slices. Test for this — the bug is silent.broadcast_tois a view;repeatis a copy. Don't confuse them. We implementbroadcast_tobecause every elementwise op already broadcasts implicitly;repeatis unnecessary in Phase 8.
When to consult solutions/¶
After all 7 ops pass all tests. solutions/02-reduction-and-shape-ops-ref.md will compare your _unbroadcast/reshape closures against the canonical implementations.
Next lab: lab/03-matmul-softmax-ce.md.