English · Español
03 — Optimizers, derived from "what memory do we have?"¶
🇪🇸 Cada optimizador se construye añadiendo una pieza de memoria al paso anterior: SGD = solo el gradiente actual; Momentum = media móvil del gradiente; Adam = media móvil de g + media móvil de g²; AdamW = Adam + weight decay desacoplado.
This page derives the optimizer hierarchy as a sequence of "add one ingredient at a time," not as a memorised table.
Plain gradient descent (GD)¶
Given a differentiable loss L(θ), the gradient descent update is:
η is the learning rate.
Why it works (intuition): the gradient points in the direction of steepest ascent; subtracting it moves toward steepest descent. For small enough η, each step decreases L.
Why it's slow (in practice): for an anisotropic loss (long thin valley), the gradient points mostly across the valley (toward the steep walls), not along it (toward the minimum). Each step zigzags. Phase 3's condition-number argument explains the geometry; the next optimizers fix the symptom.
Stochastic gradient descent (SGD)¶
In ML, computing the full gradient ∇L(θ) requires the full dataset — expensive. Instead, we use a stochastic estimate from a mini-batch:
The update equation looks identical to GD:
But g_t is noisy — it's an unbiased estimate of ∇L(θ_t) with variance that scales as 1 / |B|. The noise actually helps escape saddle points (Phase 16 will revisit). The dominant cost-per-step is now O(|B|), not O(N).
Momentum¶
The intuition: if past gradients agreed on a direction, that direction is probably right. Average them.
Define a velocity vector v_t as an exponential moving average of gradients:
β ∈ [0, 1) is the momentum coefficient (commonly 0.9). Then:
Derivation. Solving the recurrence:
A weighted sum of past gradients, with recent ones weighted higher. At β = 0.9, the half-life is log(0.5) / log(0.9) ≈ 6.6 steps — so v_t is dominated by the last ~10 gradients.
Why it helps: in a long thin valley, the gradient across the valley alternates sign step-to-step (zigzag) → cancels in the average. The gradient along the valley consistently points the same way → reinforces. Net effect: momentum suppresses zigzag, amplifies the consistent direction.
Failure mode: if you blow past a minimum (overshoot), momentum carries you further. Hence the β < 1 decay — eventually momentum dies out.
Nesterov momentum¶
A variant where the gradient is evaluated at the predicted future position:
(then θ_{t+1} = θ_t - η v_t). The "look-ahead" gives slightly better convergence in convex problems; in practice, the difference vs plain momentum is small for neural networks. Mention once; default to plain momentum.
AdaGrad — per-parameter scaling¶
The intuition: parameters that have historically seen big gradients should have small learning rates (they're already moving fast); parameters with small history should have big learning rates.
Define a per-parameter "history of squared gradient":
(element-wise). Then:
Properties:
- Parameters with consistent gradients accumulate large
s→ smaller effective LR. - Parameters with rare gradients (e.g., sparse features) keep large effective LR.
s_tmonotonically increases, so effective LR monotonically decreases. Eventually too small to make progress. This is the AdaGrad decay problem; RMSProp fixes it.
Used in early sparse-feature ML (text classification with bag-of-words). Rarely the right choice for deep learning.
RMSProp — exponential moving average of g²¶
Replace the cumulative sum with an EMA:
β_2 ≈ 0.999. Update:
s_t is now an EMA of squared gradients — no monotonic increase, no decay-to-zero LR problem.
Properties: same per-parameter scaling intuition as AdaGrad, but s_t "forgets" old gradients. Works much better in practice.
Adam — Momentum + RMSProp + bias correction¶
Adam combines the two ideas:
-
First moment (EMA of gradient — like Momentum): $\(m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t\)$
-
Second moment (EMA of squared gradient — like RMSProp): $\(v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\)$
-
Bias correction. At early steps (
t = 1, 2, 3), the EMAs are biased toward zero because they started at zero. Correct:
$\(\hat{m}_t = m_t / (1 - \beta_1^t), \qquad \hat{v}_t = v_t / (1 - \beta_2^t)\)$
- Update:
$\(\theta_{t+1} = \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)\)$
Default hyperparameters: β_1 = 0.9, β_2 = 0.999, ε = 1e-8. Memorize.
Why bias correction matters¶
At t = 1: m_1 = (1 - β_1) g_1 = 0.1 g_1. Without correction, the effective update is η · 0.1 g_1 / sqrt(0.001 g_1²) — η · 0.1 / sqrt(0.001) ≈ η · 3.16. The first step is 3× larger than intended.
With correction: m̂_1 = m_1 / (1 - 0.9) = g_1. v̂_1 = v_1 / 0.001 = g_1². The first update becomes η · g_1 / sqrt(g_1²) = η · sign(g_1). Unit-step size, as intended.
As t grows, β^t → 0, so 1 - β^t → 1, so the correction has no effect. It only matters at the start.
Why Adam is "the default"¶
Two practical reasons:
-
Auto-tuned per-parameter LR.
1/sqrt(v̂)approximates a diagonal preconditioner — each parameter gets a step size scaled by its recent gradient magnitude. The loss surface becomes "approximately round" in Adam's coordinates. -
Robust to hyperparameter choice. Adam with
lr=1e-3works in a much wider range of settings than SGD with any singlelr. For research code where tuning is expensive, Adam is the safe choice.
For frontier training (large LLM pre-training), variants of AdamW (next) plus careful LR schedules dominate.
AdamW — decoupled weight decay¶
L2 weight decay (a.k.a. weight regularisation) is the loss term (λ/2) ||θ||². Its gradient is λθ.
The "Adam with L2" approach adds λθ to the gradient g_t before computing m_t, v_t. This is wrong in a subtle way: the decay term flows through the EMAs and the per-parameter scaling, so heavily-decayed parameters get their decay attenuated by the 1/sqrt(v̂) divisor.
AdamW decouples the decay: apply it directly to θ, not to the gradient.
Two effects:
- The
λθ_tterm is applied uniformly to all parameters, independent ofv̂. - The Adam update and the decay update are added (or, equivalently, the decay is implicit in
θ_t ← (1 - ηλ) θ_tbefore the Adam step).
Empirically, AdamW gives better generalisation than Adam-with-L2. The Loshchilov-Hutter paper (2017) made this the standard.
Default λ: 0.01 for transformers, 0.1 for some vision models. Tune per problem.
Why 1/sqrt(v̂) instead of 1/v̂?¶
If you used 1/v̂ directly, the update would have units of 1/g, which doesn't make dimensional sense — the update should have units of θ, which means same units as g · learning_rate. Dividing by sqrt(g²) = |g| gives a unit-step update modulated by m̂, which is correct.
(Adagrad and RMSProp also use sqrt(s); same dimensional argument.)
Worked: stepping through Adam for two iterations¶
Setup: θ_0 = 1.0, η = 0.1, β_1 = 0.9, β_2 = 0.999, ε = 1e-8. Loss L = θ²; gradient g_t = 2 θ_t.
Step 1. θ_0 = 1.0, g_1 = 2.0.
m_1 = 0.9 · 0 + 0.1 · 2.0 = 0.2v_1 = 0.999 · 0 + 0.001 · 4.0 = 0.004m̂_1 = 0.2 / (1 - 0.9) = 2.0v̂_1 = 0.004 / (1 - 0.999) = 4.0θ_1 = 1.0 - 0.1 · 2.0 / (sqrt(4.0) + 1e-8) = 1.0 - 0.1 · 2.0 / 2.0 = 1.0 - 0.1 = 0.9
Step 2. θ_1 = 0.9, g_2 = 1.8.
m_2 = 0.9 · 0.2 + 0.1 · 1.8 = 0.18 + 0.18 = 0.36v_2 = 0.999 · 0.004 + 0.001 · 3.24 = 0.003996 + 0.00324 = 0.007236m̂_2 = 0.36 / (1 - 0.81) = 1.8947...v̂_2 = 0.007236 / (1 - 0.998001) ≈ 3.620...θ_2 = 0.9 - 0.1 · 1.8947 / (sqrt(3.620) + ε) ≈ 0.9 - 0.1 · 1.8947 / 1.9026 ≈ 0.9 - 0.0996 ≈ 0.8004
Compare to SGD with η = 0.1: θ_2_sgd = 0.9 - 0.1 · 1.8 = 0.72. SGD moves faster here because the gradient is large and not anisotropic. Adam's adaptive scaling shines only when gradients are anisotropic — which is exactly the case for real neural networks.
You'll redo this exercise in lab 02 on Rosenbrock, which is anisotropic.
Drill problems¶
Solutions in solutions/03-optimizers-derived-ref.md at phase open.
- Derive Momentum from "I want to take an exponential moving average of past gradients."
- Show that AdaGrad's effective LR for a parameter monotonically decreases. Show RMSProp's does not.
- For Adam, show that bias correction recovers the unbiased estimate of
E[g]andE[g²]ifgis i.i.d. (Hint: take expectations ofm_tandv_t.) - Derive AdamW's "shrink-then-step" interpretation: show that
θ_t ← (1 - ηλ) θ_t(decay) followed by Adam's normal step is equivalent to the unified equation given above. - Why does Adam-with-L2 fail to apply equal weight decay across parameters with different
v̂? - Suppose
β_1 = 0(no first-moment EMA). Adam reduces to what previously-defined optimizer?
One-paragraph recap¶
Each optimizer adds one piece of memory to the one before: SGD has only the current gradient; Momentum adds a running mean of past gradients; AdaGrad adds a running sum of squared gradients for per-parameter LR; RMSProp replaces the sum with an EMA (fixing AdaGrad's decay-to-zero problem); Adam combines Momentum and RMSProp with bias correction for early-step stability; AdamW decouples weight decay from the gradient. Each addition is motivated by a failure mode of the previous version, and each can be derived from "what running average do I need to compute?" The Phase 4 lab implements all four (SGD, Momentum, Adam, AdamW) as pure functions over a state dict and races them on the Rosenbrock function to see the geometric behaviour.
Next: theory/04-lr-schedules-and-warmup.md.