English · Español
Phase 4 — Quizzes¶
🇪🇸 Espejo legible de
data/quizzes/phase-04-calculus-optimization.yaml. Incluye la derivación canónica∂L/∂z = p - yy el motivo del warmup en Adam.
Source of truth: data/quizzes/phase-04-calculus-optimization.yaml.
q-04-01 — Why does momentum suppress zig-zag in a thin valley?¶
- Momentum projects gradients onto the principal eigenvector of the Hessian.
- The exponential moving average of gradients makes cross-valley components cancel (sign alternates) while along-valley components reinforce (sign consistent).
- Momentum reduces the learning rate adaptively per-step.
- Momentum applies a sign function to the gradient.
Answer
**Choice 2.** No projection or sign nonlinearity involved — just averaging that cancels alternating-sign components and accumulates same-sign ones.q-04-02 — Optimizer-family statements (multi-choice)¶
- AdaGrad divides by sqrt of cumulative sum of squared gradients.
- RMSProp replaces the cumulative sum with an EMA.
- Adam = RMSProp + momentum + bias correction.
- AdamW couples weight decay into the gradient before EMA updates.
- SGD-with-momentum has zero memory state per parameter beyond θ.
Answer
**Choices 1, 2, 3.** Choice 4 is the *coupled* (wrong) version; AdamW's W is *decoupled*. Choice 5 ignores the velocity vector v.q-04-03 — Softmax cross-entropy gradient (free)¶
Answer
Result: `∂L/∂z_i = p_i - y_i`. Derivation: `∂L/∂p_j = -y_j/p_j`; then `∂L/∂z_i = Σ_j (-y_j/p_j) · p_j (δ_ij - p_i) = -y_i + p_i Σ_j y_j = p_i - y_i` because `Σ y_j = 1`. This is the formula behind "logits minus targets."q-04-04 — When does warmup matter? (free)¶
Answer
Early in training, Adam's second-moment estimate (EMA of g²) has high variance from few samples, so the bias-corrected step can be large and noisy. Linearly ramping `lr` from 0 keeps the unstable early steps small, giving activations time to stabilize.q-04-05 — Forward-mode vs reverse-mode AD¶
- When m >> n; forward-mode does O(n) sweeps per output.
- When n >> m; reverse-mode does O(m) sweeps per input gradient, vs forward-mode's O(n).
- Reverse-mode is always faster.
- Identical complexity.