English · Español
Phase 7 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-07-scalar-autograd.yaml. Incluye la traza de la "diamante" (aycapareciendo dos veces).
Source of truth: data/quizzes/phase-07-scalar-autograd.yaml.
q-07-01 — Why += not = in _backward accumulation?¶
- Because
+=is faster. - Because a node can appear multiple times in the graph (diamond); each occurrence contributes a partial gradient that must be summed;
=overwrites all but the last. - Because
=breaks the topological sort. - Because
=is not implemented for floats.
Answer
**Choice 2.** The diamond pattern is the canonical failure: `a` feeds into two children, both feeding the loss. Each contributes a partial. With `=`, only the last sticks — tests without diamonds pass, the diamond test fails.q-07-02 — Why reverse topological order? (multi-choice)¶
- When N's
_backwardruns, it readsN.grad, which must already hold the total upstream gradient. - Reverse topo guarantees every child of N has already run, so
N.gradis complete. - Visiting in forward order during backward propagates gradients before upstream values are known.
- Reverse topo is required for forward-mode AD too.
- Order doesn't matter as long as
+=is used.
Answer
**Choices 1, 2, 3.** Choice 4 is wrong (forward-mode goes forward). Choice 5 is the common misconception: even with `+=`, a node visited too early reads an incomplete `.grad` and propagates the wrong magnitude to its parents.q-07-03 — Diamond trace (free)¶
L = (a*b + c)*(a - c) with a=2, b=3, c=4. Compute the gradients, show the two paths to ∂L/∂a.
Answer
- **∂L/∂a** via `(a·b + c)`: `f · b = -2 · 3 = -6`. - **∂L/∂a** via `(a - c)`: `e · 1 = 10 · 1 = 10`. - **Sum:** `-6 + 10 = 4`. - **∂L/∂b** = `f · a = -2 · 2 = -4`. - **∂L/∂c** = `-2 + (-10) = -12`.q-07-04 — Finite differences vs PyTorch as oracle (free)¶
Answer
Finite differences carry two errors: **truncation** (`O(ε²)` for central differences) and **floating-point cancellation** (large for small ε). The "right" ε is problem-dependent; you stack tolerances on tolerances. PyTorch (or any symbolic AD) computes the **exact** derivative algebraically — reproducible to machine precision, the better oracle.q-07-05 — zero_grad necessity¶
- Because backward leaks memory if grads aren't reset.
- Because backward accumulates with
+=; forgetting to zero means step N's gradient is the sum of steps 1..N — effective step size grows linearly, loss diverges. - Because
optimizer.stepis non-idempotent. - Because PyTorch raises if
.gradis not None.