English · Español
Break — Replace += with = in _backward¶
🇪🇸 La ruptura canónica del autograd: cambia
parent.grad += ...porparent.grad = .... Los tests lineales pasan; el test de la "diamante" falla con un número que parece "casi correcto", lo que lo hace especialmente didáctico.
Target: src/minigrad/scalar.py Value._backward closures, or your hand-written equivalent.
Hypothesis¶
The learner predicts: "Replacing += with = will make any computation where a variable appears only once still work (the closure runs once, the assignment is the only contribution), but will silently produce wrong gradients on diamond patterns (variable appears twice) — keeping only the last contribution. The diamond test from theory/03-worked-backprop.md is the canonical failing case."
The break¶
In every _backward closure, change:
def _backward():
- a.grad += local_grad_a * out.grad
- b.grad += local_grad_b * out.grad
+ a.grad = local_grad_a * out.grad
+ b.grad = local_grad_b * out.grad
Apply this to all ops (add, mul, sub, etc.) — partial breaks are worse than complete ones because they make diagnosis even harder.
Run procedure¶
The diamond test from theory §03:
uv run python -c "
from src.minigrad.scalar import Value
a = Value(2.0, label='a')
b = Value(3.0, label='b')
c = Value(4.0, label='c')
# L = (a*b + c) * (a - c)
L = (a * b + c) * (a - c)
L.backward()
print(f'a.grad = {a.grad} (expected 4)')
print(f'b.grad = {b.grad} (expected -4)')
print(f'c.grad = {c.grad} (expected -12)')
"
For comparison, a linear (non-diamond) test:
uv run python -c "
from src.minigrad.scalar import Value
x = Value(3.0)
y = Value(5.0)
# z = x*y + x — x appears twice but in a simple shape; still a diamond.
# Use a strictly linear chain instead:
z = x * y # x appears ONCE
z.backward()
print(f'x.grad = {x.grad} (expected 5)')
print(f'y.grad = {y.grad} (expected 3)')
"
Expected failure mode¶
With the break in place:
Linear (one occurrence):
x.grad = 5 <-- correct
y.grad = 3 <-- correct
Diamond:
a.grad = -6 <-- WRONG, should be 4 (kept only the last contribution from ab._backward)
b.grad = -4 <-- correct (b appears once)
c.grad = -2 <-- WRONG, should be -12 (kept only the last from e._backward)
Subtle: a.grad = -6 is the contribution from the (a·b + c) path only; the (a - c) path was overwritten. c.grad = -2 is the e = ab + c contribution; the f = a - c contribution was overwritten.
Two of three gradients are wrong, but the values look plausible (not nan, not huge). The bug ships if you don't have a diamond test.
Diagnostic¶
From logs alone:
- Compare against PyTorch's
torch.autograd.gradon the same expression. It will reporta.grad = 4,c.grad = -12. Diff vs your implementation immediately reveals the discrepancy. - Inspect
a.gradbefore the last_backwardruns. If you instrument_backwardto printa.gradbefore and after assignment, you will seea.grad: 10 -> -6— i.e., the value was replaced, not augmented. - Run the standard test from
theory/03. That test is the hand-computed diamond; it exists precisely to catch this bug.
Lesson¶
The whole point of reverse-mode AD is that every appearance of a variable in the computation graph contributes one term to its gradient (chain rule via the sum over paths). The data-structure that turns "sum over paths" into code is += on .grad. Replace += with = and you lose the sum-over-paths semantics. The math is still being applied, but only the last path contributes.
This is also why zero_grad must be called between training steps: the += that enables diamond accumulation also means gradients from step N persist into step N+1 unless explicitly cleared. The same operator, two failure modes.
References¶
- Griewank & Walther, Evaluating Derivatives, §3 (the formal statement of reverse-mode AD as sum-over-paths).
- Karpathy, micrograd (the codebase Phase 7's
minigradis modeled on) — its_backwardclosures use+=deliberately; reading the source after the break is reverted is a 5-minute confirmation exercise.