Skip to content

English · Español

Phase 7 — Quizzes

🇪🇸 Espejo legible de data/quizzes/phase-07-scalar-autograd.yaml. Incluye la traza de la "diamante" (a y c apareciendo dos veces).

Source of truth: data/quizzes/phase-07-scalar-autograd.yaml.


q-07-01 — Why += not = in _backward accumulation?

  1. Because += is faster.
  2. Because a node can appear multiple times in the graph (diamond); each occurrence contributes a partial gradient that must be summed; = overwrites all but the last.
  3. Because = breaks the topological sort.
  4. Because = is not implemented for floats.
Answer **Choice 2.** The diamond pattern is the canonical failure: `a` feeds into two children, both feeding the loss. Each contributes a partial. With `=`, only the last sticks — tests without diamonds pass, the diamond test fails.

q-07-02 — Why reverse topological order? (multi-choice)

  1. When N's _backward runs, it reads N.grad, which must already hold the total upstream gradient.
  2. Reverse topo guarantees every child of N has already run, so N.grad is complete.
  3. Visiting in forward order during backward propagates gradients before upstream values are known.
  4. Reverse topo is required for forward-mode AD too.
  5. Order doesn't matter as long as += is used.
Answer **Choices 1, 2, 3.** Choice 4 is wrong (forward-mode goes forward). Choice 5 is the common misconception: even with `+=`, a node visited too early reads an incomplete `.grad` and propagates the wrong magnitude to its parents.

q-07-03 — Diamond trace (free)

L = (a*b + c)*(a - c) with a=2, b=3, c=4. Compute the gradients, show the two paths to ∂L/∂a.

Answer - **∂L/∂a** via `(a·b + c)`: `f · b = -2 · 3 = -6`. - **∂L/∂a** via `(a - c)`: `e · 1 = 10 · 1 = 10`. - **Sum:** `-6 + 10 = 4`. - **∂L/∂b** = `f · a = -2 · 2 = -4`. - **∂L/∂c** = `-2 + (-10) = -12`.

q-07-04 — Finite differences vs PyTorch as oracle (free)

Answer Finite differences carry two errors: **truncation** (`O(ε²)` for central differences) and **floating-point cancellation** (large for small ε). The "right" ε is problem-dependent; you stack tolerances on tolerances. PyTorch (or any symbolic AD) computes the **exact** derivative algebraically — reproducible to machine precision, the better oracle.

q-07-05 — zero_grad necessity

  1. Because backward leaks memory if grads aren't reset.
  2. Because backward accumulates with +=; forgetting to zero means step N's gradient is the sum of steps 1..N — effective step size grows linearly, loss diverges.
  3. Because optimizer.step is non-idempotent.
  4. Because PyTorch raises if .grad is not None.
Answer **Choice 2.** The same `+=` that enables the diamond pattern also makes gradients persist across steps. By step 100, effective lr = 100× the configured lr.