Skip to content

English · Español

Break — Replace += with = in _backward

🇪🇸 La ruptura canónica del autograd: cambia parent.grad += ... por parent.grad = .... Los tests lineales pasan; el test de la "diamante" falla con un número que parece "casi correcto", lo que lo hace especialmente didáctico.

Target: src/minigrad/scalar.py Value._backward closures, or your hand-written equivalent.

Hypothesis

The learner predicts: "Replacing += with = will make any computation where a variable appears only once still work (the closure runs once, the assignment is the only contribution), but will silently produce wrong gradients on diamond patterns (variable appears twice) — keeping only the last contribution. The diamond test from theory/03-worked-backprop.md is the canonical failing case."

The break

In every _backward closure, change:

 def _backward():
-    a.grad += local_grad_a * out.grad
-    b.grad += local_grad_b * out.grad
+    a.grad = local_grad_a * out.grad
+    b.grad = local_grad_b * out.grad

Apply this to all ops (add, mul, sub, etc.) — partial breaks are worse than complete ones because they make diagnosis even harder.

Run procedure

The diamond test from theory §03:

uv run python -c "
from src.minigrad.scalar import Value

a = Value(2.0, label='a')
b = Value(3.0, label='b')
c = Value(4.0, label='c')

# L = (a*b + c) * (a - c)
L = (a * b + c) * (a - c)
L.backward()

print(f'a.grad = {a.grad}  (expected 4)')
print(f'b.grad = {b.grad}  (expected -4)')
print(f'c.grad = {c.grad}  (expected -12)')
"

For comparison, a linear (non-diamond) test:

uv run python -c "
from src.minigrad.scalar import Value

x = Value(3.0)
y = Value(5.0)
# z = x*y + x — x appears twice but in a simple shape; still a diamond.
# Use a strictly linear chain instead:
z = x * y          # x appears ONCE
z.backward()
print(f'x.grad = {x.grad}  (expected 5)')
print(f'y.grad = {y.grad}  (expected 3)')
"

Expected failure mode

With the break in place:

Linear (one occurrence):
  x.grad = 5         <-- correct
  y.grad = 3         <-- correct

Diamond:
  a.grad = -6        <-- WRONG, should be 4 (kept only the last contribution from ab._backward)
  b.grad = -4        <-- correct (b appears once)
  c.grad = -2        <-- WRONG, should be -12 (kept only the last from e._backward)

Subtle: a.grad = -6 is the contribution from the (a·b + c) path only; the (a - c) path was overwritten. c.grad = -2 is the e = ab + c contribution; the f = a - c contribution was overwritten.

Two of three gradients are wrong, but the values look plausible (not nan, not huge). The bug ships if you don't have a diamond test.

Diagnostic

From logs alone:

  1. Compare against PyTorch's torch.autograd.grad on the same expression. It will report a.grad = 4, c.grad = -12. Diff vs your implementation immediately reveals the discrepancy.
  2. Inspect a.grad before the last _backward runs. If you instrument _backward to print a.grad before and after assignment, you will see a.grad: 10 -> -6 — i.e., the value was replaced, not augmented.
  3. Run the standard test from theory/03. That test is the hand-computed diamond; it exists precisely to catch this bug.

Lesson

The whole point of reverse-mode AD is that every appearance of a variable in the computation graph contributes one term to its gradient (chain rule via the sum over paths). The data-structure that turns "sum over paths" into code is += on .grad. Replace += with = and you lose the sum-over-paths semantics. The math is still being applied, but only the last path contributes.

This is also why zero_grad must be called between training steps: the += that enables diamond accumulation also means gradients from step N persist into step N+1 unless explicitly cleared. The same operator, two failure modes.

References

  • Griewank & Walther, Evaluating Derivatives, §3 (the formal statement of reverse-mode AD as sum-over-paths).
  • Karpathy, micrograd (the codebase Phase 7's minigrad is modeled on) — its _backward closures use += deliberately; reading the source after the break is reverted is a 5-minute confirmation exercise.