Skip to content

English · Español

Lab 02 — Reduction and shape ops

Goal: add the reduction + shape family to Tensor: sum mean reshape transpose broadcast_to getitem cat stack. The "shape" ops are no-ops in the math sense (local Jacobian is identity) but require careful index bookkeeping in backward. The "reduction" ops are the converse: math is trivial, but the backward must broadcast_to the upstream gradient back to the input shape.

Estimated time: 3–4 hours.

Prereqs: Labs 00 + 01. Theory 02-tensor-op-derivatives.md re-read (especially the reduction backward derivation).


What you produce

  • All 7 ops added to src/minitorch/tensor.py.
  • tests/test_reductions.py with cross-check + gradcheck per reduction op.
  • tests/test_shape_ops.py with shape-roundtrip tests + gradcheck per shape op.

The shape-contract reminder

After every _backward, parent.grad.shape == parent.data.shape. For reduction ops, this means broadcasting the upstream gradient back to the input shape. For shape ops, this means inverting the shape transformation on the gradient.

🇪🇸 La regla mental: una reduction colapsa ejes hacia adelante, y los expande hacia atrás; una reshape reorganiza el layout, y el backward solo desreorganiza. No hay álgebra nueva, solo cuidar el shape.

TODOs (per op)

For each op below, do the five things from Lab 01 (forward, backward, PyTorch test, gradcheck test, shape-edge test). Shape ops do not need broadcasting tests.

Block A — reductions

  • sum(self, axis=None, keepdims=False): forward self.data.sum(axis=..., keepdims=...). Backward: the upstream gradient has shape equal to the forward output; you must broadcast_to(self.shape) after potentially re-inserting the reduced axes.
  • Implementation hint: if keepdims=False and axis is not None, the upstream out.grad has rank self.data.ndim - len(axes). Use np.expand_dims on each reduced axis to restore the reduced dims as size-1, then np.broadcast_to(self.shape).
  • When axis=None, the output is a scalar; backward is np.broadcast_to(out.grad, self.shape).

  • mean(self, axis=None, keepdims=False): forward self.data.mean(axis=..., keepdims=...). Backward identical to sum but divide by N, where N is the number of elements reduced. Cache N in the closure.

Block B — pure shape ops

  • reshape(self, new_shape: tuple[int, ...]): forward self.data.reshape(new_shape). Backward self.grad += out.grad.reshape(self.shape). No size change, no math.
  • transpose(self, axes: tuple[int, ...] | None = None): forward self.data.transpose(axes). Backward: transpose the upstream by the inverse permutation. Hint: if axes = (1, 0, 2), the inverse is argsort(axes) = (1, 0, 2) (involution here); in general use np.argsort(axes).
  • broadcast_to(self, shape: tuple[int, ...]): forward np.broadcast_to(self.data, shape) (returns a view; if your downstream ops mutate, make a copy — but they don't mutate, so view is fine). Backward: this is the only shape op that needs _unbroadcast — the gradient comes back at shape and must be reduced to self.shape.

Block C — indexing/concat

  • __getitem__(self, key): forward self.data[key]. Backward: allocate np.zeros_like(self.data), scatter the upstream gradient at key. Hint: grad[key] += out.grad works for basic slices and fancy indexing without duplicate indices. If key could contain duplicates (e.g. self[[0, 0, 1]]), use np.add.at(grad, key, out.grad) to ensure accumulation.
  • cat(tensors: list[Tensor], axis: int) (module-level function, not a method): forward np.concatenate([t.data for t in tensors], axis=axis). Backward: slice the upstream gradient along axis according to each tensor's size on that axis; contribute each slice to the matching parent.
  • stack(tensors: list[Tensor], axis: int): forward np.stack([t.data for t in tensors], axis=axis). Backward: split along the new axis and squeeze.

Tricky cases to cover

Case 1 — sum with keepdims=False over a single axis

x = Tensor(np.random.randn(3, 4), requires_grad=True)
y = x.sum(axis=0)      # shape (4,)
y.sum().backward()
assert x.grad.shape == (3, 4)
assert np.allclose(x.grad, 1.0)  # uniform gradient

If you forget to re-insert axis=0 before broadcasting, out.grad of shape (4,) broadcasts to (3, 4) only because NumPy aligns it on the right by default — but for axis=1 (output shape (3,)), the alignment would be wrong and you'd get the wrong gradient.

Case 2 — mean over axis=None

x = Tensor(np.random.randn(2, 3), requires_grad=True)
y = x.mean()
y.backward()
assert np.allclose(x.grad, 1.0 / 6)  # uniform 1/N gradient

Case 3 — getitem with duplicate indices

x = Tensor(np.array([1.0, 2.0, 3.0]), requires_grad=True)
y = x[[0, 0, 1]]   # shape (3,) selecting x[0] twice and x[1] once
y.sum().backward()
assert np.allclose(x.grad, [2.0, 1.0, 0.0])  # x[0] contributes twice

If you wrote x.grad[key] += out.grad, you'd get [1.0, 1.0, 0.0] (the duplicate is overwritten, not accumulated). Use np.add.at.

Case 4 — transpose of a rank-3 tensor with a non-trivial permutation

x = Tensor(np.random.randn(2, 3, 4), requires_grad=True)
y = x.transpose((2, 0, 1))   # shape (4, 2, 3)
y.sum().backward()
assert x.grad.shape == (2, 3, 4)

The inverse permutation: np.argsort((2, 0, 1)) = (1, 2, 0). Backward applies out.grad.transpose((1, 2, 0)) to recover the (2, 3, 4) shape.

Case 5 — cat of two tensors with different sizes on the cat axis

a = Tensor(np.random.randn(2, 3), requires_grad=True)
b = Tensor(np.random.randn(2, 5), requires_grad=True)
y = cat([a, b], axis=1)      # shape (2, 8)
y.sum().backward()
assert a.grad.shape == (2, 3)
assert b.grad.shape == (2, 5)

The backward must remember the per-tensor size along the cat axis to slice correctly. Cache [t.shape[axis] for t in tensors] in the closure.

Constraints

  • No new external deps. NumPy + standard library only.
  • No in-place ops. reshape always returns a new Tensor (its data may be a view, but the Tensor identity is new — important for the DAG).
  • broadcast_to returns a view of self.data. Document this. The forward shares memory with the parent; the backward re-allocates via _unbroadcast. This is the pattern PyTorch uses with expand (vs repeat, which copies).

Test patterns

Reduction cross-check (one example)

def test_sum_axis_pytorch_cross():
    rng = np.random.default_rng(42)
    a = rng.standard_normal((4, 5)).astype(np.float64)
    A = Tensor(a, requires_grad=True)
    L = A.sum(axis=0).sum()  # scalar
    L.backward()
    At = torch.tensor(a, dtype=torch.float64, requires_grad=True)
    Lt = At.sum(dim=0).sum()
    Lt.backward()
    np.testing.assert_allclose(A.grad, At.grad.numpy(), rtol=1e-7)

Shape gradcheck (one example, grammar-flavoured)

def test_reshape_grammar_grid_gradcheck():
    # (3, 5) person-tense logits flattened to (15,) and back.
    rng = np.random.default_rng(0)
    grid = Tensor(rng.standard_normal((3, 5)), requires_grad=True)
    f = lambda: grid.reshape((15,)).sum()
    assert gradcheck(f, grid, eps=1e-6, atol=1e-4)

Indexing test (grammar-flavoured)

def test_getitem_select_third_person():
    # Pick out third-person row from (3, 5) logits.
    logits = Tensor(np.random.default_rng(0).standard_normal((3, 5)), requires_grad=True)
    out = logits[2].sum()         # third person, all tenses
    out.backward()
    expected = np.zeros((3, 5))
    expected[2] = 1.0
    np.testing.assert_allclose(logits.grad, expected)

Stop conditions

Done when:

  1. All 7 ops implemented.
  2. Each has ≥ 2 tests (PyTorch cross + gradcheck). All green.
  3. Tricky cases 1–5 all green.
  4. mypy --strict clean. cat and stack are typed as module-level Callable[[list[Tensor], int], Tensor].
  5. You can answer: "why does broadcast_to.backward need _unbroadcast but reshape.backward doesn't?"

Pitfalls

  • keepdims=False reduction backward. The upstream gradient is rank-reduced; you must re-insert the reduced axes as size-1 before broadcasting. Forgetting this gives a silently wrong gradient (wrong sum direction).
  • transpose inverse permutation. np.argsort(axes) is the correct inverse for any permutation. Hand-deriving the inverse is error-prone — use argsort.
  • getitem with boolean masks. Use np.add.at for the backward; assignment-with-mask works in forward but is fragile in backward.
  • cat/stack parent identity. If the same tensor appears in the input list twice (cat([a, a], axis=0)), a.grad must accumulate from both slices. Test for this — the bug is silent.
  • broadcast_to is a view; repeat is a copy. Don't confuse them. We implement broadcast_to because every elementwise op already broadcasts implicitly; repeat is unnecessary in Phase 8.

When to consult solutions/

After all 7 ops pass all tests. solutions/02-reduction-and-shape-ops-ref.md will compare your _unbroadcast/reshape closures against the canonical implementations.


Next lab: lab/03-matmul-softmax-ce.md.