English · Español

02 — AdamW + warmup + cosine decay + gradient clipping¶

🇪🇸 Cuatro piezas que parecen detalles y deciden si tu loss curve es una pendiente suave o una sierra. Aquí derivamos cada una desde Phase 4, las ensamblamos, y vemos por qué la receta moderna funciona en este orden y no en otro.

Phase 4 derived the optimizer math from scratch. Phase 9 implemented SGD and Adam in minitorch/optim.py. This file is the implementation reference: it restates the equations in the exact form Borja will type into src/minitrain/loop.py, in the exact order they apply, with each variable named the way the code names them.

AdamW — the equations as you write them¶

For each parameter tensor $\theta$ with gradient $g_t$ at step $t$:

\[ m_t \leftarrow \beta_1 \, m_{t-1} + (1 - \beta_1) \, g_t \]

\[ v_t \leftarrow \beta_2 \, v_{t-1} + (1 - \beta_2) \, g_t^2 \]

\[ \hat m_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat v_t = \frac{v_t}{1 - \beta_2^t} \]

\[ \theta_t \leftarrow \theta_{t-1} - \eta_t \left( \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon} + \lambda \, \theta_{t-1} \right) \]

Default hyperparameters (recommended for Phase 18 unless your runs say otherwise):

$\beta_1 = 0.9$ — first-moment decay
$\beta_2 = 0.95$ — second-moment decay (modern choice; original Adam used 0.999, modern LLM training uses 0.95)
$\epsilon = 10^{-8}$
$\lambda = 0.1$ — weight decay coefficient
$\eta_{\max} = 3 \times 10^{-4}$ — peak LR (after warmup)
$\eta_{\min} = 3 \times 10^{-5}$ — floor LR at end of cosine decay

Three things that look like details but aren't¶

1. $\beta_2 = 0.95$ vs $0.999$. With a 600-form corpus and ~103k params, the second-moment estimate $v_t$ converges fast. $\beta_2 = 0.999$ takes ~1000 steps to "see" any given parameter's gradient magnitude. $\beta_2 = 0.95$ takes ~20. We don't have 1000 steps to spare — the entire training run is on the order of a few thousand steps.

2. Bias correction $\hat m_t = m_t / (1 - \beta_1^t)$ is not optional. At step $t=1$, $m_1 = (1 - \beta_1) g_1 = 0.1 g_1$. Without bias correction, the first 100 steps update $\theta$ by ~10× less than they should. Cosine warmup hides this somewhat, but the optimizer must still bias-correct internally. Common bug: implementing the update with $m_t$ instead of $\hat m_t$ and being surprised that "warmup is too aggressive". It's not warmup — it's the optimizer never warming up.

3. $\lambda \theta_{t-1}$ is decoupled weight decay. The "AdamW" in the name (vs vanilla Adam) is the decoupling: the weight decay term is added to the update, not to the gradient. Coupling weight decay into the gradient (g_t += λ θ_{t-1}) makes AdamW collapse to Adam-with-L2-reg and breaks the geometry — see Loshchilov & Hutter (2019). Phase 4's theory/04-optimizers.md already derived this; if it's hazy, re-read that file.

Cosine schedule with linear warmup¶

Two regimes:

Warmup for the first $W$ steps: $$\eta_t = \eta_{\max} \cdot \frac{t}{W}, \quad t \in [0, W)$$
Cosine decay for $t \in [W, T]$: $$\eta_t = \eta_{\min} + \tfrac{1}{2} (\eta_{\max} - \eta_{\min}) \left( 1 + \cos\frac{\pi (t - W)}{T - W} \right)$$

Where $T$ is total training steps.

Defaults: - $W = 100$ (about 5% of total steps) - $T = 2000$ (about 50 epochs over the 240-form train set with batch size ~6)

Why warmup is non-optional for transformers¶

At step 0, the model is randomly initialized. Loss is high (~$\ln V$ where $V$ is vocab size — for $V = 512$, that's ~6.2). The gradient is large and poorly conditioned: the Hessian is far from the local quadratic, so a normal-sized LR step over-shoots wildly. Without warmup:

Step 1: weights are pushed in some direction with magnitude $\eta_{\max} \cdot \|g_1\|$.
Step 2: gradients explode because the model is now far from anywhere reasonable.
NaN by step 50.

This failure mode is bug #2 of the three Phase-19 engineered breaks. You'll see it on the dashboard. Warmup eliminates it by ramping $\eta$ linearly from 0 to $\eta_{\max}$ over $W$ steps, giving the optimizer time to estimate $v_t$ (the per-parameter scale) before taking full-size steps.

Why cosine specifically¶

Three alternative schedules: - Constant (no decay): can match cosine for short runs but loses 1-3% PPL on long runs (the LR is "too high" near the end, preventing fine convergence). - Linear decay: matches cosine within 1% but the LR drops too fast near the end. - Step decay: discontinuities in the LR cause loss spikes at the transitions.

Cosine combines smooth decay (no spikes) with a slow tail (small LR for many late steps, allowing fine convergence). It's not magic — it's a reasonable curve shape. Phase 4 plotted all of them.

Gradient clipping¶

After computing gradients, before the optimizer step, clip the global L2 norm:

\[ \|g\|_2 = \sqrt{\sum_{\text{all params}} \|g_\theta\|_F^2} \]

If $\|g\|_2 > c$ (where $c$ is the clip threshold, default $c = 1.0$):

\[ g \leftarrow g \cdot \frac{c}{\|g\|_2} \]

This rescales all gradient tensors uniformly. Per-tensor clipping is wrong: it changes the direction of the update across parameters, not just the magnitude. Global-norm clipping preserves direction.

Why clip?¶

Two reasons:

Defends against rare outlier batches. Most batches have $\|g\|_2 < 1$. Occasionally a batch with a very confidently-wrong prediction produces $\|g\|_2 \approx 50$. That single step destabilizes the optimizer (the moments now think the typical gradient is 50× larger than it is, and future steps are starved for size). Clipping prevents one bad batch from poisoning the moment estimates.
Cheap insurance. $c = 1.0$ is rarely exceeded in healthy training. When it is, you want to know — log $\|g\|_2$ every step and watch for spikes. Phase 19's dashboard plots this.

The clip threshold $c$ is a hyperparameter, but $c = 1.0$ is almost always fine. Setting $c < 0.1$ silently throttles training; $c > 10$ doesn't actually clip.

Putting it together: the optimizer step¶

def step(self, params, grads):
    self.t += 1
    g_norm_sq = sum((g * g).sum() for g in grads.values())
    g_norm = np.sqrt(g_norm_sq)

    # 1. clip
    clip_factor = min(1.0, self.clip / (g_norm + 1e-12))

    # 2. learning rate
    if self.t < self.warmup:
        lr = self.lr_max * self.t / self.warmup
    else:
        progress = (self.t - self.warmup) / (self.total - self.warmup)
        lr = self.lr_min + 0.5 * (self.lr_max - self.lr_min) * (1 + math.cos(math.pi * progress))

    # 3. AdamW update per parameter
    for name, p in params.items():
        g = grads[name] * clip_factor
        self.m[name] = self.beta1 * self.m[name] + (1 - self.beta1) * g
        self.v[name] = self.beta2 * self.v[name] + (1 - self.beta2) * (g * g)
        m_hat = self.m[name] / (1 - self.beta1 ** self.t)
        v_hat = self.v[name] / (1 - self.beta2 ** self.t)
        p -= lr * (m_hat / (np.sqrt(v_hat) + self.eps) + self.weight_decay * p)

The order is: norm → clip → schedule → moment update → bias correct → step. Get the order wrong and you'll see one of:

Clipping after the AdamW update: the moments still see the unclipped gradient, so a future batch is destabilized.
Bias correction skipped or applied to $m$ but not $v$: asymmetric warmup that biases the early updates.
Weight decay applied to gradients instead of update: AdamW collapses to Adam-with-L2.

Drill problems¶

AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.95$. At step $t = 10$, the bias correction factors are $(1 - 0.9^{10}) \approx 0.65$ and $(1 - 0.95^{10}) \approx 0.40$. What fraction of the "true" first moment is in $m_{10}$? What about $v_{10}$? Why are these so different?
The warmup is $W = 100$ and $\eta_{\max} = 3 \times 10^{-4}$. At step 25, what is $\eta_{25}$?
The full training is $T = 2000$ steps with $W = 100$. At step 1500, what is $\eta_{1500}$? (Cosine progresses by $(1500 - 100) / (2000 - 100) = 0.737$ through the decay; $\cos(\pi \cdot 0.737) \approx -0.69$.)
The global gradient norm at step 50 is 12.0, clip threshold is 1.0. The gradient tensor for layer 3's MLP has Frobenius norm 4.0 before clipping. What's its norm after?

If all four are crisp, move on.

One-paragraph recap¶

AdamW + linear warmup + cosine decay + global-norm clipping is the modern recipe. AdamW differs from Adam by decoupling weight decay into the update, not the gradient. Warmup linearly ramps $\eta$ from 0 to $\eta_{\max}$ over the first $W$ steps so the optimizer can estimate $v_t$ before taking full-size updates. Cosine decay smoothly drops $\eta$ to $\eta_{\min}$ over the remaining steps, allowing fine convergence at the end. Global L2-norm clipping with $c = 1.0$ prevents a single bad batch from poisoning the moment estimates. The implementation order is norm → clip → schedule → moment update → bias correct → step, and getting it wrong silently breaks one of the four pieces.

What this section does NOT cover¶

EMA (exponential moving average of weights). Stubbed in Phase 18, real implementation in Phase 26+.
Layerwise LR / parameter-group LRs. Phase 28 (LoRA) uses these.
Lookahead / Lion / other modern optimizers. Outside scope.
Loss-scale-tied scheduling for fp16. Phase 26.

Next: theory/03-mixed-precision-preview.md.