English · Español
03 — Worked backprop, by hand¶
🇪🇸 Una expresión, dos páginas. Calculamos
L = (a·b + c) · (a - c)cona=2, b=3, c=4, primero por forward, luego derivamos∂L/∂a, ∂L/∂b, ∂L/∂cpor dos métodos: (i) regla de la cadena simbólica, (ii) recorrido inverso del grafo. Los dos resultados deben coincidir. Y luego verificamos con diferencias finitas.
The expression¶
with a = 2.0, b = 3.0, c = 4.0.
This expression is small enough to do by hand, large enough to feature a shared variable (a appears in two places — the diamond pattern from 01-computation-graphs.md).
Step 1: Forward pass, by hand¶
Compute intermediate values:
ab = a · b = 2 · 3 = 6e = ab + c = 6 + 4 = 10(left factor)f = a - c = 2 - 4 = -2(right factor)L = e · f = 10 · (-2) = -20
Step 2: Backward by symbolic chain rule¶
Compute each partial via the chain rule directly:
∂L/∂a:
L depends on a through two paths:
- via e: e = ab + c = a·b + c, so ∂e/∂a = b. Then ∂L/∂a (via e) = ∂L/∂e · ∂e/∂a = f · b = -2 · 3 = -6.
- via f: f = a - c, so ∂f/∂a = 1. Then ∂L/∂a (via f) = ∂L/∂f · ∂f/∂a = e · 1 = 10.
Total: ∂L/∂a = -6 + 10 = 4.
∂L/∂b:
b appears only in ab = a · b → e. ∂e/∂b = a. Then ∂L/∂b = ∂L/∂e · ∂e/∂b = f · a = -2 · 2 = -4.
∂L/∂c:
c appears in e (as ab + c) and in f (as a - c).
- via e: ∂e/∂c = 1. Contribution: ∂L/∂e · 1 = f = -2.
- via f: ∂f/∂c = -1. Contribution: ∂L/∂f · (-1) = -e = -10.
Total: ∂L/∂c = -2 + (-10) = -12.
Gradients (symbolic): ∂L/∂a = 4, ∂L/∂b = -4, ∂L/∂c = -12.
Step 3: Backward by graph traversal¶
Now do it the way minigrad will. First, draw the graph (parent → child):
a ─┬─> ab ─┐
│ v
│ (+) ──> e ──┐
│ ^ v
b ─┘ │ (·) ──> L
c ─┬───────┘ ^
│ │
└──> f ───────────┘
^
a ──────┘ (same a as above)
^
c ──(-)─┘ (same c as above; f = a - c)
Nodes (in graph order): a, b, c, ab, e, f, L — 7 nodes. Ops: ab=a·b, e=ab+c, f=a-c, L=e·f.
Topological order (parents before children, one valid order): [a, b, c, ab, e, f, L]. (Any order where each node comes after all its parents is valid.)
Reverse: [L, f, e, ab, c, b, a].
Seed: L.grad = 1 (because ∂L/∂L = 1). All other .grad = 0 initially.
Walk in reverse:
1. L._backward(): L = e · f. Mul backward:
- e.grad += f.data · L.grad = -2 · 1 = -2. Now e.grad = -2.
- f.grad += e.data · L.grad = 10 · 1 = 10. Now f.grad = 10.
2. f._backward(): f = a - c. Sub backward:
- a.grad += 1 · f.grad = 10. Now a.grad = 10.
- c.grad += -1 · f.grad = -10. Now c.grad = -10.
3. e._backward(): e = ab + c. Add backward:
- ab.grad += 1 · e.grad = -2. Now ab.grad = -2.
- c.grad += 1 · e.grad = -2. Now c.grad = -10 + (-2) = -12. (Diamond accumulation!)
4. ab._backward(): ab = a · b. Mul backward:
- a.grad += b.data · ab.grad = 3 · (-2) = -6. Now a.grad = 10 + (-6) = 4. (Diamond accumulation!)
- b.grad += a.data · ab.grad = 2 · (-2) = -4. Now b.grad = -4.
5. c._backward(): leaf, no-op.
6. b._backward(): leaf, no-op.
7. a._backward(): leaf, no-op.
Final gradients: a.grad = 4, b.grad = -4, c.grad = -12.
These match the symbolic answers exactly. ✓
Step 4: Verify with finite differences¶
Pick ε = 1e-5. The central-difference gradient for, say, ∂L/∂a is:
Compute:
- L(2.0 + 1e-5) = (2.00001 · 3 + 4) · (2.00001 - 4) = (6.00003 + 4) · (-1.99999) = 10.00003 · -1.99999 ≈ -20.00003 - 0.00006 ≈ -20.000060...
Let me redo this more carefully:
- ab = 2.00001 · 3 = 6.00003
- e = 6.00003 + 4 = 10.00003
- f = 2.00001 - 4 = -1.99999
- L⁺ = 10.00003 · (-1.99999) = -20.00006 + tiny terms
And L⁻:
- ab = 1.99999 · 3 = 5.99997
- e = 9.99997
- f = -2.00001
- L⁻ = 9.99997 · -2.00001 ≈ -20.00 + 0.00006... ≈ -19.99994
Difference: L⁺ - L⁻ ≈ -20.00006 - (-19.99994) = -0.00012. Divide by 2ε = 2e-5: -0.00012 / 2e-5 = -6.
Wait — that gives -6, not 4. Did I mess up?
Let me redo more carefully with explicit numbers. a = 2 + ε where ε = 1e-5:
ab = (2+ε) · 3 = 6 + 3εe = 6 + 3ε + 4 = 10 + 3εf = (2 + ε) - 4 = -2 + εL⁺ = (10 + 3ε)(-2 + ε) = -20 + 10ε - 6ε + 3ε² = -20 + 4ε + O(ε²)
For a = 2 - ε:
- ab = 6 - 3ε
- e = 10 - 3ε
- f = -2 - ε
- L⁻ = (10 - 3ε)(-2 - ε) = -20 - 10ε + 6ε + 3ε² = -20 - 4ε + O(ε²)
L⁺ - L⁻ = 8ε + O(ε³). Divide by 2ε: 4. ✓
(My arithmetic in the first pass was sloppy. The clean symbolic derivation gives 4, matching the graph traversal and the chain-rule symbolic answer.)
Apply the same finite-difference check to ∂L/∂b and ∂L/∂c — same drill. You should get -4 and -12.
Three lessons from this worked example¶
Lesson 1: graph traversal == symbolic chain rule == finite differences¶
All three give the same answer. They had better — they're three views of the same identity.
- Symbolic chain rule: the math, on paper.
- Graph traversal: the algorithm, mechanized.
- Finite differences: the empirical check.
If your minigrad produces a gradient that disagrees with finite differences, your _backward is wrong. If it produces a gradient that disagrees with PyTorch, your _backward is wrong (PyTorch is the oracle). Tests in Phase 7 use PyTorch as the primary oracle and finite differences as a sanity backup (Phase 8 will use gradcheck more centrally).
Lesson 2: diamond accumulation is the whole point of +=¶
a.grad got contributions from two places: -6 via ab and +10 via f. They summed to 4. If we had used = instead of += in _backward, only the last contribution would have stuck — 4 (because a._backward ran last in our traversal). It would have looked right by luck. For c.grad, the last contribution was -2, but the true answer is -12. With =, we'd silently get -2. Test that.
Lesson 3: topological order matters¶
The reverse traversal must visit a node only after all its downstream children have run their _backward. Otherwise the node's .grad is still partial when its own _backward reads it, and downstream parents get the wrong contribution.
In our example, when ab._backward() runs, it reads ab.grad. That .grad was set by e._backward(), which ran in step 3. If we'd somehow processed ab before e, ab.grad would still be 0 and a.grad += 3·0 = 0 would be the contribution — wrong.
The topological-sort step in backward() guarantees this. Don't skimp on it.
Lab preview¶
Lab 02 will ask you to:
1. Compute L = (a·b + c)·(a - c) with these exact values using minigrad.scalar.
2. Call L.backward().
3. Assert a.grad == 4, b.grad == -4, c.grad == -12 (to floating point tolerance).
If those three assertions pass, your autograd works on the diamond pattern. That's the most important single test in the phase.
One-paragraph recap¶
Hand-tracing L = (a·b + c)·(a - c) with the diamond a and the diamond c gives ∂L/∂a = 4, ∂L/∂b = -4, ∂L/∂c = -12, verifiable by both symbolic chain rule and by graph traversal with += accumulation in topological reverse order. Finite differences confirm the same numbers. The two diamond accumulations — a.grad = 10 - 6 = 4 and c.grad = -10 - 2 = -12 — are exactly what would silently fail if _backward used = instead of +=. This is the single most pedagogically important test case in Phase 7.