English · Español

Phase 7 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-07-scalar-autograd.yaml. Incluye la traza de la "diamante" (a y c apareciendo dos veces).

Source of truth: data/quizzes/phase-07-scalar-autograd.yaml.

q-07-01 — Why `+=` not `=` in `_backward` accumulation?¶

Because += is faster.
Because a node can appear multiple times in the graph (diamond); each occurrence contributes a partial gradient that must be summed; = overwrites all but the last.
Because = breaks the topological sort.
Because = is not implemented for floats.

Answer

**Choice 2.** The diamond pattern is the canonical failure: `a` feeds into two children, both feeding the loss. Each contributes a partial. With `=`, only the last sticks — tests without diamonds pass, the diamond test fails.

q-07-02 — Why reverse topological order? (multi-choice)¶

When N's _backward runs, it reads N.grad, which must already hold the total upstream gradient.
Reverse topo guarantees every child of N has already run, so N.grad is complete.
Visiting in forward order during backward propagates gradients before upstream values are known.
Reverse topo is required for forward-mode AD too.
Order doesn't matter as long as += is used.

Answer

**Choices 1, 2, 3.** Choice 4 is wrong (forward-mode goes forward). Choice 5 is the common misconception: even with `+=`, a node visited too early reads an incomplete `.grad` and propagates the wrong magnitude to its parents.

q-07-03 — Diamond trace (free)¶

L = (a*b + c)*(a - c) with a=2, b=3, c=4. Compute the gradients, show the two paths to ∂L/∂a.

Answer

- **∂L/∂a** via `(a·b + c)`: `f · b = -2 · 3 = -6`. - **∂L/∂a** via `(a - c)`: `e · 1 = 10 · 1 = 10`. - **Sum:** `-6 + 10 = 4`. - **∂L/∂b** = `f · a = -2 · 2 = -4`. - **∂L/∂c** = `-2 + (-10) = -12`.

q-07-04 — Finite differences vs PyTorch as oracle (free)¶

Answer

Finite differences carry two errors: **truncation** (`O(ε²)` for central differences) and **floating-point cancellation** (large for small ε). The "right" ε is problem-dependent; you stack tolerances on tolerances. PyTorch (or any symbolic AD) computes the **exact** derivative algebraically — reproducible to machine precision, the better oracle.

q-07-05 — `zero_grad` necessity¶

Because backward leaks memory if grads aren't reset.
Because backward accumulates with +=; forgetting to zero means step N's gradient is the sum of steps 1..N — effective step size grows linearly, loss diverges.
Because optimizer.step is non-idempotent.
Because PyTorch raises if .grad is not None.

Answer

**Choice 2.** The same `+=` that enables the diamond pattern also makes gradients persist across steps. By step 100, effective lr = 100× the configured lr.

Phase 7 — Quizzes¶

q-07-01 — Why += not = in _backward accumulation?¶

q-07-02 — Why reverse topological order? (multi-choice)¶

q-07-03 — Diamond trace (free)¶

q-07-04 — Finite differences vs PyTorch as oracle (free)¶

q-07-05 — zero_grad necessity¶

q-07-01 — Why `+=` not `=` in `_backward` accumulation?¶

q-07-05 — `zero_grad` necessity¶