English · Español

03 — Optimizers, derived from "what memory do we have?"¶

🇪🇸 Cada optimizador se construye añadiendo una pieza de memoria al paso anterior: SGD = solo el gradiente actual; Momentum = media móvil del gradiente; Adam = media móvil de g + media móvil de g²; AdamW = Adam + weight decay desacoplado.

This page derives the optimizer hierarchy as a sequence of "add one ingredient at a time," not as a memorised table.

Plain gradient descent (GD)¶

Given a differentiable loss L(θ), the gradient descent update is:

\[ \theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) \]

η is the learning rate.

Why it works (intuition): the gradient points in the direction of steepest ascent; subtracting it moves toward steepest descent. For small enough η, each step decreases L.

Why it's slow (in practice): for an anisotropic loss (long thin valley), the gradient points mostly across the valley (toward the steep walls), not along it (toward the minimum). Each step zigzags. Phase 3's condition-number argument explains the geometry; the next optimizers fix the symptom.

Stochastic gradient descent (SGD)¶

In ML, computing the full gradient ∇L(θ) requires the full dataset — expensive. Instead, we use a stochastic estimate from a mini-batch:

\[ g_t = \frac{1}{|B|} \sum_{i \in B} \nabla \ell(x_i, \theta_t) \]

The update equation looks identical to GD:

\[ \theta_{t+1} = \theta_t - \eta g_t \]

But g_t is noisy — it's an unbiased estimate of ∇L(θ_t) with variance that scales as 1 / |B|. The noise actually helps escape saddle points (Phase 16 will revisit). The dominant cost-per-step is now O(|B|), not O(N).

Momentum¶

The intuition: if past gradients agreed on a direction, that direction is probably right. Average them.

Define a velocity vector v_t as an exponential moving average of gradients:

\[ v_t = \beta v_{t-1} + g_t \]

β ∈ [0, 1) is the momentum coefficient (commonly 0.9). Then:

\[ \theta_{t+1} = \theta_t - \eta v_t \]

Derivation. Solving the recurrence:

\[ v_t = \sum_{k=0}^{t} \beta^{t-k} g_k \]

A weighted sum of past gradients, with recent ones weighted higher. At β = 0.9, the half-life is log(0.5) / log(0.9) ≈ 6.6 steps — so v_t is dominated by the last ~10 gradients.

Why it helps: in a long thin valley, the gradient across the valley alternates sign step-to-step (zigzag) → cancels in the average. The gradient along the valley consistently points the same way → reinforces. Net effect: momentum suppresses zigzag, amplifies the consistent direction.

Failure mode: if you blow past a minimum (overshoot), momentum carries you further. Hence the β < 1 decay — eventually momentum dies out.

Nesterov momentum¶

A variant where the gradient is evaluated at the predicted future position:

\[ v_t = \beta v_{t-1} + \nabla L(\theta_t - \eta \beta v_{t-1}) \]

(then θ_{t+1} = θ_t - η v_t). The "look-ahead" gives slightly better convergence in convex problems; in practice, the difference vs plain momentum is small for neural networks. Mention once; default to plain momentum.

AdaGrad — per-parameter scaling¶

The intuition: parameters that have historically seen big gradients should have small learning rates (they're already moving fast); parameters with small history should have big learning rates.

Define a per-parameter "history of squared gradient":

\[ s_t = s_{t-1} + g_t^2 \]

(element-wise). Then:

\[ \theta_{t+1} = \theta_t - \eta \cdot g_t / (\sqrt{s_t} + \epsilon) \]

Properties:

Parameters with consistent gradients accumulate large s → smaller effective LR.
Parameters with rare gradients (e.g., sparse features) keep large effective LR.
s_t monotonically increases, so effective LR monotonically decreases. Eventually too small to make progress. This is the AdaGrad decay problem; RMSProp fixes it.

Used in early sparse-feature ML (text classification with bag-of-words). Rarely the right choice for deep learning.

RMSProp — exponential moving average of `g²`¶

Replace the cumulative sum with an EMA:

\[ s_t = \beta_2 s_{t-1} + (1 - \beta_2) g_t^2 \]

β_2 ≈ 0.999. Update:

\[ \theta_{t+1} = \theta_t - \eta \cdot g_t / (\sqrt{s_t} + \epsilon) \]

s_t is now an EMA of squared gradients — no monotonic increase, no decay-to-zero LR problem.

Properties: same per-parameter scaling intuition as AdaGrad, but s_t "forgets" old gradients. Works much better in practice.

Adam — Momentum + RMSProp + bias correction¶

Adam combines the two ideas:

First moment (EMA of gradient — like Momentum): $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
Second moment (EMA of squared gradient — like RMSProp): $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$
Bias correction. At early steps (t = 1, 2, 3), the EMAs are biased toward zero because they started at zero. Correct:

$$\hat{m}_t = m_t / (1 - \beta_1^t), \qquad \hat{v}_t = v_t / (1 - \beta_2^t)$$

Update:

$$\theta_{t+1} = \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$$

Default hyperparameters: β_1 = 0.9, β_2 = 0.999, ε = 1e-8. Memorize.

Why bias correction matters¶

At t = 1: m_1 = (1 - β_1) g_1 = 0.1 g_1. Without correction, the effective update is η · 0.1 g_1 / sqrt(0.001 g_1²) — η · 0.1 / sqrt(0.001) ≈ η · 3.16. The first step is 3× larger than intended.

With correction: m̂_1 = m_1 / (1 - 0.9) = g_1. v̂_1 = v_1 / 0.001 = g_1². The first update becomes η · g_1 / sqrt(g_1²) = η · sign(g_1). Unit-step size, as intended.

As t grows, β^t → 0, so 1 - β^t → 1, so the correction has no effect. It only matters at the start.

Why Adam is "the default"¶

Two practical reasons:

Auto-tuned per-parameter LR. 1/sqrt(v̂) approximates a diagonal preconditioner — each parameter gets a step size scaled by its recent gradient magnitude. The loss surface becomes "approximately round" in Adam's coordinates.
Robust to hyperparameter choice. Adam with lr=1e-3 works in a much wider range of settings than SGD with any single lr. For research code where tuning is expensive, Adam is the safe choice.

For frontier training (large LLM pre-training), variants of AdamW (next) plus careful LR schedules dominate.

AdamW — decoupled weight decay¶

L2 weight decay (a.k.a. weight regularisation) is the loss term (λ/2) ||θ||². Its gradient is λθ.

The "Adam with L2" approach adds λθ to the gradient g_t before computing m_t, v_t. This is wrong in a subtle way: the decay term flows through the EMAs and the per-parameter scaling, so heavily-decayed parameters get their decay attenuated by the 1/sqrt(v̂) divisor.

AdamW decouples the decay: apply it directly to θ, not to the gradient.

\[ \theta_{t+1} = \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) - \eta \lambda \theta_t \]

Two effects:

The λθ_t term is applied uniformly to all parameters, independent of v̂.
The Adam update and the decay update are added (or, equivalently, the decay is implicit in θ_t ← (1 - ηλ) θ_t before the Adam step).

Empirically, AdamW gives better generalisation than Adam-with-L2. The Loshchilov-Hutter paper (2017) made this the standard.

Default λ: 0.01 for transformers, 0.1 for some vision models. Tune per problem.

Why `1/sqrt(v̂)` instead of `1/v̂`?¶

If you used 1/v̂ directly, the update would have units of 1/g, which doesn't make dimensional sense — the update should have units of θ, which means same units as g · learning_rate. Dividing by sqrt(g²) = |g| gives a unit-step update modulated by m̂, which is correct.

(Adagrad and RMSProp also use sqrt(s); same dimensional argument.)

Worked: stepping through Adam for two iterations¶

Setup: θ_0 = 1.0, η = 0.1, β_1 = 0.9, β_2 = 0.999, ε = 1e-8. Loss L = θ²; gradient g_t = 2 θ_t.

Step 1. θ_0 = 1.0, g_1 = 2.0.

m_1 = 0.9 · 0 + 0.1 · 2.0 = 0.2
v_1 = 0.999 · 0 + 0.001 · 4.0 = 0.004
m̂_1 = 0.2 / (1 - 0.9) = 2.0
v̂_1 = 0.004 / (1 - 0.999) = 4.0
θ_1 = 1.0 - 0.1 · 2.0 / (sqrt(4.0) + 1e-8) = 1.0 - 0.1 · 2.0 / 2.0 = 1.0 - 0.1 = 0.9

Step 2. θ_1 = 0.9, g_2 = 1.8.

m_2 = 0.9 · 0.2 + 0.1 · 1.8 = 0.18 + 0.18 = 0.36
v_2 = 0.999 · 0.004 + 0.001 · 3.24 = 0.003996 + 0.00324 = 0.007236
m̂_2 = 0.36 / (1 - 0.81) = 1.8947...
v̂_2 = 0.007236 / (1 - 0.998001) ≈ 3.620...
θ_2 = 0.9 - 0.1 · 1.8947 / (sqrt(3.620) + ε) ≈ 0.9 - 0.1 · 1.8947 / 1.9026 ≈ 0.9 - 0.0996 ≈ 0.8004

Compare to SGD with η = 0.1: θ_2_sgd = 0.9 - 0.1 · 1.8 = 0.72. SGD moves faster here because the gradient is large and not anisotropic. Adam's adaptive scaling shines only when gradients are anisotropic — which is exactly the case for real neural networks.

You'll redo this exercise in lab 02 on Rosenbrock, which is anisotropic.

Drill problems¶

Solutions in solutions/03-optimizers-derived-ref.md at phase open.

Derive Momentum from "I want to take an exponential moving average of past gradients."
Show that AdaGrad's effective LR for a parameter monotonically decreases. Show RMSProp's does not.
For Adam, show that bias correction recovers the unbiased estimate of E[g] and E[g²] if g is i.i.d. (Hint: take expectations of m_t and v_t.)
Derive AdamW's "shrink-then-step" interpretation: show that θ_t ← (1 - ηλ) θ_t (decay) followed by Adam's normal step is equivalent to the unified equation given above.
Why does Adam-with-L2 fail to apply equal weight decay across parameters with different v̂?
Suppose β_1 = 0 (no first-moment EMA). Adam reduces to what previously-defined optimizer?

One-paragraph recap¶

Each optimizer adds one piece of memory to the one before: SGD has only the current gradient; Momentum adds a running mean of past gradients; AdaGrad adds a running sum of squared gradients for per-parameter LR; RMSProp replaces the sum with an EMA (fixing AdaGrad's decay-to-zero problem); Adam combines Momentum and RMSProp with bias correction for early-step stability; AdamW decouples weight decay from the gradient. Each addition is motivated by a failure mode of the previous version, and each can be derived from "what running average do I need to compute?" The Phase 4 lab implements all four (SGD, Momentum, Adam, AdamW) as pure functions over a state dict and races them on the Rosenbrock function to see the geometric behaviour.

Next: theory/04-lr-schedules-and-warmup.md.