English · Español
Lab 03 — Implement and plot five learning-rate schedules¶
🇪🇸 Cinco curvas en los mismos ejes: constante, escalonada, coseno, warmup→coseno, WSD. Las dibujas, las anotas, y eliges una para Fase 18.
Objective¶
Implement five LR schedule functions in src/minigrad/schedules.py and plot them on a single chart. By eye, identify the one you'll use in Phase 18's training loop and justify in 3-5 sentences.
Setup¶
matplotlib,numpy.- The theory file
04-lr-schedules-and-warmup.mdfor the formulas.
Tasks¶
Part A — Implement the schedules¶
import numpy as np
def constant(t, eta_0=1.0):
return eta_0
def step_decay(t, eta_0=1.0, step=1000, gamma=0.1):
return eta_0 * (gamma ** (t // step))
def cosine(t, eta_0=1.0, eta_min=0.0, T_total=10000):
if t >= T_total:
return eta_min
return eta_min + 0.5 * (eta_0 - eta_min) * (1 + np.cos(np.pi * t / T_total))
def warmup_cosine(t, eta_0=1.0, eta_min=0.0, T_warm=500, T_total=10000):
if t < T_warm:
return eta_0 * (t / T_warm)
return cosine(t - T_warm, eta_0, eta_min, T_total - T_warm)
def wsd(t, eta_0=1.0, eta_min=0.0, T_warm=500, T_stable=7000, T_total=10000):
if t < T_warm:
return eta_0 * (t / T_warm)
if t < T_stable:
return eta_0
# Linear decay from eta_0 to eta_min over [T_stable, T_total]
if t >= T_total:
return eta_min
frac = (t - T_stable) / (T_total - T_stable)
return eta_0 * (1 - frac) + eta_min * frac
All take t and return η_t. Pure functions — no internal state.
Part B — Plot¶
import matplotlib.pyplot as plt
T_total = 10000
ts = np.arange(T_total + 1)
schedules = {
"constant": [constant(t) for t in ts],
"step (γ=0.1 @ T=1000)": [step_decay(t) for t in ts],
"cosine": [cosine(t, T_total=T_total) for t in ts],
"warmup→cosine (T_warm=500)": [warmup_cosine(t, T_warm=500, T_total=T_total) for t in ts],
"WSD (T_warm=500, T_stable=7000)": [wsd(t, T_warm=500, T_stable=7000, T_total=T_total) for t in ts],
}
plt.figure(figsize=(10, 6))
for name, vals in schedules.items():
plt.plot(ts, vals, label=name, linewidth=2)
plt.xlabel("Step")
plt.ylabel("Learning rate (relative to η_0)")
plt.legend(loc="lower left")
plt.title("LR schedules (η_0 = 1.0, T_total = 10,000 steps)")
plt.grid(True, alpha=0.3)
plt.savefig("lr_schedules.png", dpi=150)
Part C — Annotate¶
For each schedule, add 2-3 sentences to experiments/04-lr-schedules/INTERPRETATION.md:
- Constant: When you'd use it. (Mostly: as a baseline.)
- Step decay: Why "shelves" appear in the loss curve. When the manual schedule beats automatic.
- Cosine: Smooth and parameter-free (given
T_total). The default for non-warmup training. - Warmup→Cosine: The transformer default. Hessian-conditioning argument (per theory 04).
- WSD: When you want to publish a checkpoint without committing to
T_totalin advance.
Part D — Pick one for Phase 18¶
In 3-5 sentences, state which schedule you'll use in Phase 18 (the Mini-GPT training loop). Justify by: - The optimizer you're using (Adam — so warmup helps for the variance-estimate reason). - The model size (small — but the argument still applies). - The training duration (~10K steps — long enough for cosine decay to matter).
The expected answer: warmup→cosine with T_warm = 500. Confirm or argue otherwise.
Part E — Verify properties¶
Add unit tests in tests/test_schedules.py:
def test_constant_is_constant():
assert constant(0) == constant(5000) == constant(99999)
def test_warmup_starts_at_zero():
assert warmup_cosine(0, T_warm=500) == 0.0
def test_warmup_ends_at_peak():
# At t = T_warm, the warmup phase ends; cosine starts at its peak
assert abs(warmup_cosine(500, T_warm=500, T_total=10000) - 1.0) < 1e-9
def test_cosine_ends_at_eta_min():
assert abs(cosine(10000, T_total=10000) - 0.0) < 1e-9
def test_step_decay_drops():
assert step_decay(999) == 1.0
assert step_decay(1000) == 0.1
assert step_decay(1999) == 0.1
assert step_decay(2000) == 0.01
def test_wsd_three_phases():
s = wsd(t=100, T_warm=500, T_stable=7000) # warmup
assert 0 < s < 1
assert wsd(3000, T_warm=500, T_stable=7000) == 1.0 # stable
assert wsd(10000, T_warm=500, T_stable=7000, T_total=10000) == 0.0 # decayed
Deliverable¶
src/minigrad/schedules.py— the five functions.tests/test_schedules.py— the unit tests above (extended as needed).experiments/04-lr-schedules/lr_schedules.png— the chart.experiments/04-lr-schedules/INTERPRETATION.md— annotations and the Phase-18 pick.manifest.json.
Acceptance¶
- Plot renders cleanly with all five curves on the same axes.
- Each curve is visually distinct.
- The warmup→cosine and WSD curves visibly ramp up from
0overT_warmsteps. - All unit tests pass.
- The "Phase 18 pick" is justified, not just stated.
Pitfalls¶
T_totalin cosine vs warmup→cosine. Inwarmup_cosine, the cosine runs overT_total - T_warmsteps (since the firstT_warmsteps are warmup). If you useT_totaldirectly in the inner cosine, the curve ends before stepT_total.- Integer division.
t // stepis what step-decay wants.t / stepwould give a continuous decay — not the staircase shape. - Off-by-one at
t = T_warm. Whether the boundary belongs to "warmup" or "cosine" is a convention choice. Be consistent. - Plotting log-LR axis. Some references show LR on a log y-axis. For a side-by-side qualitative comparison, linear is clearer.
- Forgetting WSD's three phases. WSD has three: warmup, stable, decay. Skipping the stable phase makes it identical to warmup→cosine.
Stretch¶
- Add a one-cycle schedule (Smith): ramps up, peaks somewhere in the middle, ramps down. Plot.
- Add linear-decay-with-warmup (used by some BERT-style training). Plot.
- Demonstrate the effect: train a tiny model (or use a synthetic 2D function) with
warmup→cosinevsconstant. Show the loss curves diverge.
End of Phase 4 labs. Time to write PHASE_04_REPORT.md and prep for Phase 5.
Next: Phase 05 — Probability & Information Theory.