Skip to content

English · Español

04 — Learning-rate schedules and why warmup actually helps

🇪🇸 La tasa de aprendizaje no es un número, es una función del paso. Constante, escalonada, coseno, calentamiento. Y la pregunta clave: ¿por qué el warmup ayuda? Respuesta: por el condicionamiento de la Hessiana al principio del entrenamiento.

The five schedules you must know

Let η_t denote the learning rate at step t, with η_0 the base / peak LR and T_warm, T_total defined by the run.

Constant

\[\eta_t = \eta_0\]

The dumbest schedule. Used as a sanity baseline. Almost never optimal for transformer training.

Step decay

\[\eta_t = \eta_0 \cdot \gamma^{\lfloor t / T_\text{step} \rfloor}\]

Multiply by γ < 1 every T_step steps. Common: γ = 0.1 every T_step = 1000. Discrete drops; visible "shelves" in the loss curve.

Cosine decay (without restarts)

\[\eta_t = \eta_\text{min} + \tfrac{1}{2}(\eta_0 - \eta_\text{min})\left(1 + \cos\left(\frac{\pi \cdot t}{T_\text{total}}\right)\right)\]

Smooth decay from η_0 to η_min over T_total steps. The modern default for non-warmup training; popularised by SGDR (Loshchilov & Hutter 2017).

Linear warmup → cosine

\[\eta_t = \begin{cases} \eta_0 \cdot \dfrac{t}{T_\text{warm}} & t < T_\text{warm} \\ \text{cosine schedule from } t = T_\text{warm} & t \ge T_\text{warm} \end{cases}\]

The transformer default. T_warm is typically 1-10% of T_total. This is the one used in Phase 18.

WSD (Warmup-Stable-Decay)

\[\eta_t = \begin{cases} \eta_0 \cdot \dfrac{t}{T_\text{warm}} & t < T_\text{warm} \\ \eta_0 & T_\text{warm} \le t < T_\text{stable} \\ \text{cosine or linear to } 0 & t \ge T_\text{stable} \end{cases}\]

Warmup, then a long constant phase, then a sharp decay. Popularised by MiniCPM (2024). Lets you publish a checkpoint at the end of the "stable" phase without committing to a total T, then resume + decay later. Mentioned briefly here; in production we use linear-warmup → cosine.

The headline question: why warmup?

If η_0 is set well (small enough for stable training, large enough to be efficient), why ramp up? Why not just start at η_0?

The answer has three layers, peeled off in order of subtlety.

Layer 1 — Random initialization is noisy

At step 0, the weights are random. The gradients on the first batch carry essentially no signal — they're the gradient of loss(random weights), which has huge variance. Taking a large step in a direction defined by noise is bad.

Warmup says: "Use small steps until the signal-to-noise ratio of the gradient improves."

This is the intuitive answer. It's not wrong, but it's incomplete.

Layer 2 — The Hessian's condition number is bad early

The Hessian H = ∇² loss measures the curvature of the loss surface. The ratio of its largest to smallest eigenvalue κ = λ_max / λ_min is the condition number.

A well-conditioned loss surface (κ small) tolerates large steps. A poorly-conditioned one (κ large) does not: in the high-curvature direction, a step that's stable for λ_min explodes in the direction of λ_max.

Early in training, the Hessian is dominated by initialization-induced curvature — sharp valleys in directions where the random weights happen to align. The condition number is enormous. By a few hundred steps, the weights have moved into a flatter region; κ drops dramatically.

Warmup gives the surface time to flatten before taking big steps. This is the Hessian-conditioning view.

Layer 3 — Adam's variance estimate needs to stabilize

Adam normalises each parameter's update by sqrt(v̂_t) + ε, where v_t is the EMA of squared gradients. At step 0, v_0 = 0. The bias correction v̂_t = v_t / (1 - β_2^t) fixes the magnitude, but the direction m̂_t is still based on very few samples of the gradient.

A large LR + a single noisy m̂_t direction = potentially catastrophic update.

Warmup gives Adam's running averages time to fill with enough samples that the direction is reliable.

This view is specific to adaptive optimizers. It doesn't apply to vanilla SGD.

For transformer training (Adam-based), Layers 2 and 3 reinforce each other. Both matter.

Picking T_warm and η_0

Rule of thumb from the transformer literature:

  • T_warm: 1-10% of total steps. For Phase 18 with ~10,000 training steps, use T_warm = 500.
  • η_0 (peak LR): set by the largest LR that doesn't explode in a short test run. Then back off ~2×. For Mini-GPT at our scale, η_0 ≈ 3e-4 is a reasonable starting point; tune later.

These are heuristics. Phase 19 (training dynamics) provides the empirical methods for tuning them.

A useful intuition: LR × batch-size scaling

Empirically, for many architectures: doubling the batch size lets you ~double the LR (the "linear scaling rule," Goyal et al. 2017). The reasoning: a larger batch reduces gradient noise by 1/√B, so you can step √B larger — but the linear rule is the simpler approximation that works for moderate batch sizes.

For Phase 18 with batch size 16, you probably won't need this. For Phase 35 (distributed), you will.

What about restarts and "cosine with restarts"?

SGDR (Loshchilov & Hutter 2017) proposed restarting the cosine schedule periodically — drop back to η_0 and decay again. Helps escape sharp minima in some settings.

For modern transformer training: mostly not used. Single cosine cycle with warmup is the dominant pattern. We won't implement restarts.

Plotting the schedules

Lab 03 in this phase has you plot all five on the same axes. Expected shapes:

  • Constant: flat horizontal line.
  • Step: descending staircase.
  • Cosine: smooth half-period of cos from peak to floor.
  • Linear warmup → cosine: ramp up, then cosine.
  • WSD: ramp, flat plateau, then sharp drop.

When you draw these, the warmup-then-cosine and WSD shapes should look visibly different from the others. That visible difference is what makes them effective: they spend more "wall-clock time" at the right LR for the loss surface's actual curvature at each phase.

Why warmup is not "stop training in chaos"

A subtle misreading: "warmup means we don't really start training until step T_warm." Wrong. We are training during warmup; the small steps still move the loss. What we're avoiding is catastrophic divergence from a large step on a poorly-conditioned surface.

The loss curve during warmup should still go down — slower than during the peak phase, but down. If your loss is flat during warmup, you have a bug, not a schedule.

What this file does NOT cover

  • Convergence proofs for any of these schedules. Out of scope.
  • Schedule-free optimizers (Defazio et al. 2024). Frontier — not used here.
  • One-cycle policy (Smith). Mentioned in passing; not used.

Next: ../lab/00-derive-softmax-gradient.md