English · Español

Phase 4 — Quizzes¶

🇪🇸 Espejo legible de data/quizzes/phase-04-calculus-optimization.yaml. Incluye la derivación canónica ∂L/∂z = p - y y el motivo del warmup en Adam.

Source of truth: data/quizzes/phase-04-calculus-optimization.yaml.

q-04-01 — Why does momentum suppress zig-zag in a thin valley?¶

Momentum projects gradients onto the principal eigenvector of the Hessian.
The exponential moving average of gradients makes cross-valley components cancel (sign alternates) while along-valley components reinforce (sign consistent).
Momentum reduces the learning rate adaptively per-step.
Momentum applies a sign function to the gradient.

Answer

**Choice 2.** No projection or sign nonlinearity involved — just averaging that cancels alternating-sign components and accumulates same-sign ones.

q-04-02 — Optimizer-family statements (multi-choice)¶

AdaGrad divides by sqrt of cumulative sum of squared gradients.
RMSProp replaces the cumulative sum with an EMA.
Adam = RMSProp + momentum + bias correction.
AdamW couples weight decay into the gradient before EMA updates.
SGD-with-momentum has zero memory state per parameter beyond θ.

Answer

**Choices 1, 2, 3.** Choice 4 is the *coupled* (wrong) version; AdamW's W is *decoupled*. Choice 5 ignores the velocity vector v.

q-04-03 — Softmax cross-entropy gradient (free)¶

Answer

Result: `∂L/∂z_i = p_i - y_i`. Derivation: `∂L/∂p_j = -y_j/p_j`; then `∂L/∂z_i = Σ_j (-y_j/p_j) · p_j (δ_ij - p_i) = -y_i + p_i Σ_j y_j = p_i - y_i` because `Σ y_j = 1`. This is the formula behind "logits minus targets."

q-04-04 — When does warmup matter? (free)¶

Answer

Early in training, Adam's second-moment estimate (EMA of g²) has high variance from few samples, so the bias-corrected step can be large and noisy. Linearly ramping `lr` from 0 keeps the unstable early steps small, giving activations time to stabilize.

q-04-05 — Forward-mode vs reverse-mode AD¶

When m >> n; forward-mode does O(n) sweeps per output.
When n >> m; reverse-mode does O(m) sweeps per input gradient, vs forward-mode's O(n).
Reverse-mode is always faster.
Identical complexity.

Answer

**Choice 2.** Loss functions are R^n → R (m=1, n in the millions), so reverse-mode is the natural choice — one backward pass instead of n forward sweeps.