Skip to content

English · Español

Break — Skip the unbroadcast (axis-sum) in the add backward

🇪🇸 Bug clásico de la fase 8: olvidar que el broadcasting en forward implica suma a lo largo de los ejes replicados en backward. Sin esa suma, el gradiente sale con forma errónea — y o crashea ruidosamente, o (peor) la forma encaja por accidente y los números son basura.

Target: any add op in src/minigrad/tensor.py (or your equivalent).

Hypothesis

The learner predicts: "Without the explicit unbroadcast(grad, parent.shape) step in the backward of an op that broadcast in forward, two regimes appear: (1) when the broadcast axis was created (one operand had a smaller number of dimensions than the other), the upstream gradient's shape differs from the parent's and either crashes or silently corrupts via NumPy's quiet broadcasting on assignment; (2) when only a length-1 axis was expanded, the gradient shape happens to broadcast back, but the values are wrong by a factor of axis_length."

The break

In your Tensor.__add__:

 def _backward():
-    a.grad += unbroadcast(out.grad, a.shape)
-    b.grad += unbroadcast(out.grad, b.shape)
+    a.grad += out.grad
+    b.grad += out.grad

(unbroadcast is the helper that sums along axes where the operand was broadcast.)

Run procedure

Two test cases that exercise both regimes:

uv run python -c "
import numpy as np
from src.minigrad.tensor import Tensor

# --- Regime 1: dimension-creation broadcast (3,) + (4, 3) ---
a = Tensor(np.array([1., 2., 3.]), requires_grad=True)       # shape (3,)
b = Tensor(np.array([[10.,20.,30.],[40.,50.,60.],[70.,80.,90.],[100.,110.,120.]]), requires_grad=True)  # (4, 3)
c = a + b                                                     # (4, 3)
loss = c.sum()
loss.backward()
print('--- Regime 1 ---')
print(f'a.grad.shape = {a.grad.shape}  expected (3,)')
print(f'a.grad       = {a.grad}        expected [4., 4., 4.]')
print(f'b.grad.shape = {b.grad.shape}  expected (4, 3)')

# --- Regime 2: length-1 axis broadcast (3, 1) + (3, 5) ---
a2 = Tensor(np.array([[1.],[2.],[3.]]), requires_grad=True)   # (3, 1)
b2 = Tensor(np.ones((3, 5)), requires_grad=True)              # (3, 5)
c2 = a2 + b2                                                  # (3, 5)
loss2 = c2.sum()
loss2.backward()
print('--- Regime 2 ---')
print(f'a2.grad.shape = {a2.grad.shape}  expected (3, 1)')
print(f'a2.grad       = {a2.grad.ravel()}  expected [5., 5., 5.]')
"

Expected failure mode

Regime 1 — dimension-creation broadcast:

Without unbroadcast, out.grad has shape (4, 3) but a.grad (the leaf) is initialized to shape (3,). The += either: - Crashes with ValueError: non-broadcastable output operand with shape (3,) doesn't match the broadcast shape (4, 3) (NumPy 1.20+), or - Silently broadcasts the addition, producing a (4, 3)-shaped a.grad that no longer matches a.data.shape. The optimizer's a.data -= lr * a.grad then crashes or silently corrupts at the next step.

Regime 2 — length-1 axis broadcast:

Shape happens to align: out.grad is (3, 5), a.grad storage is (3, 1). The += broadcasts and accumulates the wrong total: a.grad[i, 0] is the sum of out.grad[i, :] columnwise — [5., 5., 5.] — but only one of the five contributions is supposed to land per element. Without unbroadcast, you also get [5., 5., 5.] because the assignment broadcasts the same value. Disguised as correct in this case.

Now flip the loss to a non-uniform one:

loss2 = (c2 * Tensor(np.arange(15).reshape(3, 5).astype(np.float64))).sum()

With the break in place, a2.grad differs from the analytic value. Compare against PyTorch.

Diagnostic

From logs alone:

  1. Crash on Regime 1. Easy to catch but only if you have a test with a dimension-creation broadcast.
  2. Shape print after backward. assert tensor.grad.shape == tensor.data.shape after every backward() is a one-line invariant that catches the silent corruption.
  3. Gradcheck. Compare against finite differences on a non-uniform loss; the analytical grad disagrees with the numerical one for any tensor that was broadcast in forward.
  4. Cross-check against PyTorch. torch.autograd.grad on the same expression gives the right answer; diff your .grad vs PyTorch's.

Lesson

Forward broadcasting is virtual replication; the chain rule says the gradient w.r.t. the replicated operand is the sum over the replicated copies. NumPy does not do this sum automatically — you must call grad.sum(axis=broadcast_axes, keepdims=...) in _backward.

This is the new family of bugs that Phase 8 introduces (Phase 7 had no broadcasting). The unbroadcast(grad, target_shape) helper is so important that it earns its own utility function tested in isolation. Phase 9's modules and Phase 10's normalization layers reuse it for every op that admits broadcasting.

References

  • The PyTorch internals walkthrough by Karpathy ("zero to hero" — building micrograd → tensor autograd) covers this transition.
  • Goodfellow, Bengio, Courville, Deep Learning, §6.5.3 (backprop through general computational graphs, including broadcasting) — the formal sum-over-replicates rule.