Skip to content

English · Español

Phase 4 — Quizzes

🇪🇸 Espejo legible de data/quizzes/phase-04-calculus-optimization.yaml. Incluye la derivación canónica ∂L/∂z = p - y y el motivo del warmup en Adam.

Source of truth: data/quizzes/phase-04-calculus-optimization.yaml.


q-04-01 — Why does momentum suppress zig-zag in a thin valley?

  1. Momentum projects gradients onto the principal eigenvector of the Hessian.
  2. The exponential moving average of gradients makes cross-valley components cancel (sign alternates) while along-valley components reinforce (sign consistent).
  3. Momentum reduces the learning rate adaptively per-step.
  4. Momentum applies a sign function to the gradient.
Answer **Choice 2.** No projection or sign nonlinearity involved — just averaging that cancels alternating-sign components and accumulates same-sign ones.

q-04-02 — Optimizer-family statements (multi-choice)

  1. AdaGrad divides by sqrt of cumulative sum of squared gradients.
  2. RMSProp replaces the cumulative sum with an EMA.
  3. Adam = RMSProp + momentum + bias correction.
  4. AdamW couples weight decay into the gradient before EMA updates.
  5. SGD-with-momentum has zero memory state per parameter beyond θ.
Answer **Choices 1, 2, 3.** Choice 4 is the *coupled* (wrong) version; AdamW's W is *decoupled*. Choice 5 ignores the velocity vector v.

q-04-03 — Softmax cross-entropy gradient (free)

Answer Result: `∂L/∂z_i = p_i - y_i`. Derivation: `∂L/∂p_j = -y_j/p_j`; then `∂L/∂z_i = Σ_j (-y_j/p_j) · p_j (δ_ij - p_i) = -y_i + p_i Σ_j y_j = p_i - y_i` because `Σ y_j = 1`. This is the formula behind "logits minus targets."

q-04-04 — When does warmup matter? (free)

Answer Early in training, Adam's second-moment estimate (EMA of g²) has high variance from few samples, so the bias-corrected step can be large and noisy. Linearly ramping `lr` from 0 keeps the unstable early steps small, giving activations time to stabilize.

q-04-05 — Forward-mode vs reverse-mode AD

  1. When m >> n; forward-mode does O(n) sweeps per output.
  2. When n >> m; reverse-mode does O(m) sweeps per input gradient, vs forward-mode's O(n).
  3. Reverse-mode is always faster.
  4. Identical complexity.
Answer **Choice 2.** Loss functions are R^n → R (m=1, n in the millions), so reverse-mode is the natural choice — one backward pass instead of n forward sweeps.