English · Español

03 — Worked backprop, by hand¶

🇪🇸 Una expresión, dos páginas. Calculamos L = (a·b + c) · (a - c) con a=2, b=3, c=4, primero por forward, luego derivamos ∂L/∂a, ∂L/∂b, ∂L/∂c por dos métodos: (i) regla de la cadena simbólica, (ii) recorrido inverso del grafo. Los dos resultados deben coincidir. Y luego verificamos con diferencias finitas.

The expression¶

\[ L = (a \cdot b + c) \cdot (a - c) \]

with a = 2.0, b = 3.0, c = 4.0.

This expression is small enough to do by hand, large enough to feature a shared variable (a appears in two places — the diamond pattern from 01-computation-graphs.md).

Step 1: Forward pass, by hand¶

Compute intermediate values:

ab = a · b = 2 · 3 = 6
e = ab + c = 6 + 4 = 10 (left factor)
f = a - c = 2 - 4 = -2 (right factor)
L = e · f = 10 · (-2) = -20

Step 2: Backward by symbolic chain rule¶

Compute each partial via the chain rule directly:

∂L/∂a:

L depends on a through two paths: - via e: e = ab + c = a·b + c, so ∂e/∂a = b. Then ∂L/∂a (via e) = ∂L/∂e · ∂e/∂a = f · b = -2 · 3 = -6. - via f: f = a - c, so ∂f/∂a = 1. Then ∂L/∂a (via f) = ∂L/∂f · ∂f/∂a = e · 1 = 10.

Total: ∂L/∂a = -6 + 10 = 4.

∂L/∂b:

b appears only in ab = a · b → e. ∂e/∂b = a. Then ∂L/∂b = ∂L/∂e · ∂e/∂b = f · a = -2 · 2 = -4.

∂L/∂c:

c appears in e (as ab + c) and in f (as a - c). - via e: ∂e/∂c = 1. Contribution: ∂L/∂e · 1 = f = -2. - via f: ∂f/∂c = -1. Contribution: ∂L/∂f · (-1) = -e = -10.

Total: ∂L/∂c = -2 + (-10) = -12.

Gradients (symbolic): ∂L/∂a = 4, ∂L/∂b = -4, ∂L/∂c = -12.

Step 3: Backward by graph traversal¶

Now do it the way minigrad will. First, draw the graph (parent → child):

a ─┬─> ab ─┐
   │       v
   │      (+) ──> e ──┐
   │       ^          v
b ─┘       │         (·) ──> L
c ─┬───────┘          ^
   │                  │
   └──> f ───────────┘
        ^
a ──────┘  (same a as above)
        ^
c ──(-)─┘  (same c as above; f = a - c)

Nodes (in graph order): a, b, c, ab, e, f, L — 7 nodes. Ops: ab=a·b, e=ab+c, f=a-c, L=e·f.

Topological order (parents before children, one valid order): [a, b, c, ab, e, f, L]. (Any order where each node comes after all its parents is valid.)

Reverse: [L, f, e, ab, c, b, a].

Seed: L.grad = 1 (because ∂L/∂L = 1). All other .grad = 0 initially.

Walk in reverse:

1. L._backward(): L = e · f. Mul backward: - e.grad += f.data · L.grad = -2 · 1 = -2. Now e.grad = -2. - f.grad += e.data · L.grad = 10 · 1 = 10. Now f.grad = 10.

2. f._backward(): f = a - c. Sub backward: - a.grad += 1 · f.grad = 10. Now a.grad = 10. - c.grad += -1 · f.grad = -10. Now c.grad = -10.

3. e._backward(): e = ab + c. Add backward: - ab.grad += 1 · e.grad = -2. Now ab.grad = -2. - c.grad += 1 · e.grad = -2. Now c.grad = -10 + (-2) = -12. (Diamond accumulation!)

4. ab._backward(): ab = a · b. Mul backward: - a.grad += b.data · ab.grad = 3 · (-2) = -6. Now a.grad = 10 + (-6) = 4. (Diamond accumulation!) - b.grad += a.data · ab.grad = 2 · (-2) = -4. Now b.grad = -4.

5. c._backward(): leaf, no-op. 6. b._backward(): leaf, no-op. 7. a._backward(): leaf, no-op.

Final gradients: a.grad = 4, b.grad = -4, c.grad = -12.

These match the symbolic answers exactly. ✓

Step 4: Verify with finite differences¶

Pick ε = 1e-5. The central-difference gradient for, say, ∂L/∂a is:

\[ \frac{\partial L}{\partial a} \approx \frac{L(a + \epsilon) - L(a - \epsilon)}{2\epsilon} \]

Compute: - L(2.0 + 1e-5) = (2.00001 · 3 + 4) · (2.00001 - 4) = (6.00003 + 4) · (-1.99999) = 10.00003 · -1.99999 ≈ -20.00003 - 0.00006 ≈ -20.000060...

Let me redo this more carefully: - ab = 2.00001 · 3 = 6.00003 - e = 6.00003 + 4 = 10.00003 - f = 2.00001 - 4 = -1.99999 - L⁺ = 10.00003 · (-1.99999) = -20.00006 + tiny terms

And L⁻: - ab = 1.99999 · 3 = 5.99997 - e = 9.99997 - f = -2.00001 - L⁻ = 9.99997 · -2.00001 ≈ -20.00 + 0.00006... ≈ -19.99994

Difference: L⁺ - L⁻ ≈ -20.00006 - (-19.99994) = -0.00012. Divide by 2ε = 2e-5: -0.00012 / 2e-5 = -6.

Wait — that gives -6, not 4. Did I mess up?

Let me redo more carefully with explicit numbers. a = 2 + ε where ε = 1e-5:

ab = (2+ε) · 3 = 6 + 3ε
e = 6 + 3ε + 4 = 10 + 3ε
f = (2 + ε) - 4 = -2 + ε
L⁺ = (10 + 3ε)(-2 + ε) = -20 + 10ε - 6ε + 3ε² = -20 + 4ε + O(ε²)

For a = 2 - ε: - ab = 6 - 3ε - e = 10 - 3ε - f = -2 - ε - L⁻ = (10 - 3ε)(-2 - ε) = -20 - 10ε + 6ε + 3ε² = -20 - 4ε + O(ε²)

L⁺ - L⁻ = 8ε + O(ε³). Divide by 2ε: 4. ✓

(My arithmetic in the first pass was sloppy. The clean symbolic derivation gives 4, matching the graph traversal and the chain-rule symbolic answer.)

Apply the same finite-difference check to ∂L/∂b and ∂L/∂c — same drill. You should get -4 and -12.

Three lessons from this worked example¶

Lesson 1: graph traversal == symbolic chain rule == finite differences¶

All three give the same answer. They had better — they're three views of the same identity.

Symbolic chain rule: the math, on paper.
Graph traversal: the algorithm, mechanized.
Finite differences: the empirical check.

If your minigrad produces a gradient that disagrees with finite differences, your _backward is wrong. If it produces a gradient that disagrees with PyTorch, your _backward is wrong (PyTorch is the oracle). Tests in Phase 7 use PyTorch as the primary oracle and finite differences as a sanity backup (Phase 8 will use gradcheck more centrally).

Lesson 2: diamond accumulation is the whole point of `+=`¶

a.grad got contributions from two places: -6 via ab and +10 via f. They summed to 4. If we had used = instead of += in _backward, only the last contribution would have stuck — 4 (because a._backward ran last in our traversal). It would have looked right by luck. For c.grad, the last contribution was -2, but the true answer is -12. With =, we'd silently get -2. Test that.

Lesson 3: topological order matters¶

The reverse traversal must visit a node only after all its downstream children have run their _backward. Otherwise the node's .grad is still partial when its own _backward reads it, and downstream parents get the wrong contribution.

In our example, when ab._backward() runs, it reads ab.grad. That .grad was set by e._backward(), which ran in step 3. If we'd somehow processed ab before e, ab.grad would still be 0 and a.grad += 3·0 = 0 would be the contribution — wrong.

The topological-sort step in backward() guarantees this. Don't skimp on it.

Lab preview¶

Lab 02 will ask you to: 1. Compute L = (a·b + c)·(a - c) with these exact values using minigrad.scalar. 2. Call L.backward(). 3. Assert a.grad == 4, b.grad == -4, c.grad == -12 (to floating point tolerance).

If those three assertions pass, your autograd works on the diamond pattern. That's the most important single test in the phase.

One-paragraph recap¶

Hand-tracing L = (a·b + c)·(a - c) with the diamond a and the diamond c gives ∂L/∂a = 4, ∂L/∂b = -4, ∂L/∂c = -12, verifiable by both symbolic chain rule and by graph traversal with += accumulation in topological reverse order. Finite differences confirm the same numbers. The two diamond accumulations — a.grad = 10 - 6 = 4 and c.grad = -10 - 2 = -12 — are exactly what would silently fail if _backward used = instead of +=. This is the single most pedagogically important test case in Phase 7.

Next: 04-reverse-mode-vs-forward-mode.md