English · Español
Lab 02 — Race SGD, Momentum, Adam, AdamW on Rosenbrock¶
🇪🇸 Cuatro optimizadores en la misma pista: la función de Rosenbrock, un valle estrecho y curvo. Animas las trayectorias. Lo que ves es por qué Adam gana en valles mal-condicionados.
Objective¶
Implement four optimizers as pure-function *_step routines in src/minigrad/optim.py. Run each on the Rosenbrock function f(x, y) = (1 - x)² + 100(y - x²)² from a common start, log the trajectory, and produce a single animated contour plot with all four trajectories overlaid.
Setup¶
- Phase 03 (linear algebra) and the theory files of Phase 04.
- The Rosenbrock function: convex on level-set scale, but the minimum (at
(1, 1)) sits at the end of a narrow, curved valley. The function is the classic test for optimizer behaviour in poorly-conditioned regions. matplotlib,numpy.
The Rosenbrock function¶
def rosenbrock(p):
x, y = p
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_grad(p):
x, y = p
dx = -2 * (1 - x) + 100 * 2 * (y - x**2) * (-2 * x)
dy = 100 * 2 * (y - x**2)
return np.array([dx, dy])
Minimum at (1, 1) where f = 0. Start every optimizer at (-1.5, 2.0).
Tasks¶
Part A — Implement the four optimizers in src/minigrad/optim.py¶
Pure-function style with explicit state dicts.
def sgd_step(params, grads, state, lr):
"""state is unused for vanilla SGD; kept for API uniformity."""
return params - lr * grads, state
def momentum_step(params, grads, state, lr, beta=0.9):
v = state.get("v", np.zeros_like(params))
v = beta * v + grads
new_params = params - lr * v
return new_params, {"v": v}
def adam_step(params, grads, state, lr, beta1=0.9, beta2=0.999, eps=1e-8):
m = state.get("m", np.zeros_like(params))
v = state.get("v", np.zeros_like(params))
t = state.get("t", 0) + 1
m = beta1 * m + (1 - beta1) * grads
v = beta2 * v + (1 - beta2) * grads**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
new_params = params - lr * m_hat / (np.sqrt(v_hat) + eps)
return new_params, {"m": m, "v": v, "t": t}
def adamw_step(params, grads, state, lr, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.0):
# Adam-with-decoupled-weight-decay: same as adam_step but with weight_decay applied directly
new_params, new_state = adam_step(params, grads, state, lr, beta1, beta2, eps)
new_params = new_params - lr * weight_decay * params
return new_params, new_state
Part B — Race them on Rosenbrock¶
def run(opt_fn, lr, n_steps=2000, start=(-1.5, 2.0), **opt_kwargs):
p = np.array(start, dtype=float)
state = {}
traj = [p.copy()]
for _ in range(n_steps):
g = rosenbrock_grad(p)
p, state = opt_fn(p, g, state, lr, **opt_kwargs)
traj.append(p.copy())
return np.array(traj)
Run with per-optimizer-tuned learning rates (since they have different stable LR ranges):
trajectories = {
"SGD": run(sgd_step, lr=2e-3),
"Momentum": run(momentum_step, lr=2e-3, beta=0.9),
"Adam": run(adam_step, lr=5e-2),
"AdamW": run(adamw_step, lr=5e-2, weight_decay=0.0),
}
(AdamW with weight_decay=0 is identical to Adam — that's the sanity check.)
Part C — Animate¶
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
# Contour
x_grid = np.linspace(-2, 2, 400)
y_grid = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (1 - X)**2 + 100 * (Y - X**2)**2
fig, ax = plt.subplots(figsize=(8, 6))
ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap="viridis", alpha=0.5)
ax.plot(1, 1, "r*", markersize=15, label="optimum")
lines = {name: ax.plot([], [], "-", label=name)[0] for name in trajectories}
ax.legend()
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)
def init():
for line in lines.values(): line.set_data([], [])
return list(lines.values())
def update(frame):
for name, line in lines.items():
traj = trajectories[name]
line.set_data(traj[:frame, 0], traj[:frame, 1])
return list(lines.values())
anim = FuncAnimation(fig, update, init_func=init, frames=range(0, 2001, 20), interval=50, blit=True)
anim.save("rosenbrock-race.mp4", writer="ffmpeg") # or .gif via PillowWriter
Part D — Interpret¶
Write a 200-word interpretation in experiments/04-optimizers-rosenbrock/INTERPRETATION.md. Address:
- SGD's path. Why does it oscillate in the steep direction?
- Momentum's path. Smoother, but does it overshoot? Where?
- Adam's path. Why does it navigate the narrow valley so much better? (Hint: per-parameter normalization by
sqrt(v̂).) - AdamW with
weight_decay=0. Verify it tracks Adam exactly. Why is this a useful sanity check?
Part E — A second sweep with weight_decay > 0¶
Run AdamW with weight_decay = 0.01 and weight_decay = 0.1. Add to the plot. Expected: heavier decay pulls the trajectory toward the origin, slowing convergence to (1, 1). Annotate.
Deliverable¶
experiments/04-optimizers-rosenbrock/:
- rosenbrock-race.mp4 (or .gif) — animation.
- final_distances.json — for each optimizer, the distance from final position to (1, 1).
- INTERPRETATION.md — 200-word write-up.
- manifest.json — versions, seeds.
Plus a static fallback PNG of the final trajectories (in case the animation doesn't render in CI).
Acceptance¶
- All four optimizers' trajectories rendered.
- Adam's final distance to
(1, 1)is < 0.1. - SGD's final distance is > Adam's (it will be — that's the lesson).
- AdamW with
weight_decay=0matches Adam exactly (element-wise difference <1e-12). INTERPRETATION.mdcorrectly attributes Adam's advantage to per-parameter scaling, not to "Adam is just better."
Pitfalls¶
- Using the same LR for all four optimizers. SGD's safe LR is ~10× smaller than Adam's on Rosenbrock. If you use Adam's LR for SGD, SGD diverges. Tune per-optimizer.
- Forgetting
t += 1in Adam. Bias correction depends ont; ift = 0, you divide by zero. - Using AdamW = Adam + L2. That's the wrong AdamW. Decoupled means applied to weights directly, not added to the gradient.
- Animation rendering failures.
ffmpegmay not be installed. Fall back toPillowWriterfor GIF. - Treating "Adam wins" as a universal lesson. Adam wins here because of the narrow valley. On other landscapes (well-conditioned, low noise), SGD-with-momentum can win. The lesson is about why, not which.
Stretch¶
- Add Nesterov momentum as a fifth track. Compare to plain momentum.
- Plot loss-vs-step in a second panel. Adam should have a smoother descent curve.
- Initialize at a different point (e.g.,
(2, 2)) and re-race. Some patterns hold; some don't.
Next: 03-lr-schedules.md