English · Español

Lab 03 — Implement and plot five learning-rate schedules¶

🇪🇸 Cinco curvas en los mismos ejes: constante, escalonada, coseno, warmup→coseno, WSD. Las dibujas, las anotas, y eliges una para Fase 18.

Objective¶

Implement five LR schedule functions in src/minigrad/schedules.py and plot them on a single chart. By eye, identify the one you'll use in Phase 18's training loop and justify in 3-5 sentences.

Setup¶

matplotlib, numpy.
The theory file 04-lr-schedules-and-warmup.md for the formulas.

Tasks¶

Part A — Implement the schedules¶

import numpy as np

def constant(t, eta_0=1.0):
    return eta_0

def step_decay(t, eta_0=1.0, step=1000, gamma=0.1):
    return eta_0 * (gamma ** (t // step))

def cosine(t, eta_0=1.0, eta_min=0.0, T_total=10000):
    if t >= T_total:
        return eta_min
    return eta_min + 0.5 * (eta_0 - eta_min) * (1 + np.cos(np.pi * t / T_total))

def warmup_cosine(t, eta_0=1.0, eta_min=0.0, T_warm=500, T_total=10000):
    if t < T_warm:
        return eta_0 * (t / T_warm)
    return cosine(t - T_warm, eta_0, eta_min, T_total - T_warm)

def wsd(t, eta_0=1.0, eta_min=0.0, T_warm=500, T_stable=7000, T_total=10000):
    if t < T_warm:
        return eta_0 * (t / T_warm)
    if t < T_stable:
        return eta_0
    # Linear decay from eta_0 to eta_min over [T_stable, T_total]
    if t >= T_total:
        return eta_min
    frac = (t - T_stable) / (T_total - T_stable)
    return eta_0 * (1 - frac) + eta_min * frac

All take t and return η_t. Pure functions — no internal state.

Part B — Plot¶

import matplotlib.pyplot as plt

T_total = 10000
ts = np.arange(T_total + 1)

schedules = {
    "constant": [constant(t) for t in ts],
    "step (γ=0.1 @ T=1000)": [step_decay(t) for t in ts],
    "cosine": [cosine(t, T_total=T_total) for t in ts],
    "warmup→cosine (T_warm=500)": [warmup_cosine(t, T_warm=500, T_total=T_total) for t in ts],
    "WSD (T_warm=500, T_stable=7000)": [wsd(t, T_warm=500, T_stable=7000, T_total=T_total) for t in ts],
}

plt.figure(figsize=(10, 6))
for name, vals in schedules.items():
    plt.plot(ts, vals, label=name, linewidth=2)
plt.xlabel("Step")
plt.ylabel("Learning rate (relative to η_0)")
plt.legend(loc="lower left")
plt.title("LR schedules (η_0 = 1.0, T_total = 10,000 steps)")
plt.grid(True, alpha=0.3)
plt.savefig("lr_schedules.png", dpi=150)

Part C — Annotate¶

For each schedule, add 2-3 sentences to experiments/04-lr-schedules/INTERPRETATION.md:

Constant: When you'd use it. (Mostly: as a baseline.)
Step decay: Why "shelves" appear in the loss curve. When the manual schedule beats automatic.
Cosine: Smooth and parameter-free (given T_total). The default for non-warmup training.
Warmup→Cosine: The transformer default. Hessian-conditioning argument (per theory 04).
WSD: When you want to publish a checkpoint without committing to T_total in advance.

Part D — Pick one for Phase 18¶

In 3-5 sentences, state which schedule you'll use in Phase 18 (the Mini-GPT training loop). Justify by: - The optimizer you're using (Adam — so warmup helps for the variance-estimate reason). - The model size (small — but the argument still applies). - The training duration (~10K steps — long enough for cosine decay to matter).

The expected answer: warmup→cosine with T_warm = 500. Confirm or argue otherwise.

Part E — Verify properties¶

Add unit tests in tests/test_schedules.py:

def test_constant_is_constant():
    assert constant(0) == constant(5000) == constant(99999)

def test_warmup_starts_at_zero():
    assert warmup_cosine(0, T_warm=500) == 0.0

def test_warmup_ends_at_peak():
    # At t = T_warm, the warmup phase ends; cosine starts at its peak
    assert abs(warmup_cosine(500, T_warm=500, T_total=10000) - 1.0) < 1e-9

def test_cosine_ends_at_eta_min():
    assert abs(cosine(10000, T_total=10000) - 0.0) < 1e-9

def test_step_decay_drops():
    assert step_decay(999) == 1.0
    assert step_decay(1000) == 0.1
    assert step_decay(1999) == 0.1
    assert step_decay(2000) == 0.01

def test_wsd_three_phases():
    s = wsd(t=100, T_warm=500, T_stable=7000)   # warmup
    assert 0 < s < 1
    assert wsd(3000, T_warm=500, T_stable=7000) == 1.0   # stable
    assert wsd(10000, T_warm=500, T_stable=7000, T_total=10000) == 0.0  # decayed

Deliverable¶

src/minigrad/schedules.py — the five functions.
tests/test_schedules.py — the unit tests above (extended as needed).
experiments/04-lr-schedules/lr_schedules.png — the chart.
experiments/04-lr-schedules/INTERPRETATION.md — annotations and the Phase-18 pick.
manifest.json.

Acceptance¶

Plot renders cleanly with all five curves on the same axes.
Each curve is visually distinct.
The warmup→cosine and WSD curves visibly ramp up from 0 over T_warm steps.
All unit tests pass.
The "Phase 18 pick" is justified, not just stated.

Pitfalls¶

T_total in cosine vs warmup→cosine. In warmup_cosine, the cosine runs over T_total - T_warm steps (since the first T_warm steps are warmup). If you use T_total directly in the inner cosine, the curve ends before step T_total.
Integer division. t // step is what step-decay wants. t / step would give a continuous decay — not the staircase shape.
Off-by-one at t = T_warm. Whether the boundary belongs to "warmup" or "cosine" is a convention choice. Be consistent.
Plotting log-LR axis. Some references show LR on a log y-axis. For a side-by-side qualitative comparison, linear is clearer.
Forgetting WSD's three phases. WSD has three: warmup, stable, decay. Skipping the stable phase makes it identical to warmup→cosine.

Stretch¶

Add a one-cycle schedule (Smith): ramps up, peaks somewhere in the middle, ramps down. Plot.
Add linear-decay-with-warmup (used by some BERT-style training). Plot.
Demonstrate the effect: train a tiny model (or use a synthetic 2D function) with warmup→cosine vs constant. Show the loss curves diverge.

End of Phase 4 labs. Time to write PHASE_04_REPORT.md and prep for Phase 5.

Next: Phase 05 — Probability & Information Theory.