Skip to content

English · Español

Lab 02 — Race SGD, Momentum, Adam, AdamW on Rosenbrock

🇪🇸 Cuatro optimizadores en la misma pista: la función de Rosenbrock, un valle estrecho y curvo. Animas las trayectorias. Lo que ves es por qué Adam gana en valles mal-condicionados.

Objective

Implement four optimizers as pure-function *_step routines in src/minigrad/optim.py. Run each on the Rosenbrock function f(x, y) = (1 - x)² + 100(y - x²)² from a common start, log the trajectory, and produce a single animated contour plot with all four trajectories overlaid.

Setup

  • Phase 03 (linear algebra) and the theory files of Phase 04.
  • The Rosenbrock function: convex on level-set scale, but the minimum (at (1, 1)) sits at the end of a narrow, curved valley. The function is the classic test for optimizer behaviour in poorly-conditioned regions.
  • matplotlib, numpy.

The Rosenbrock function

def rosenbrock(p):
    x, y = p
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(p):
    x, y = p
    dx = -2 * (1 - x) + 100 * 2 * (y - x**2) * (-2 * x)
    dy = 100 * 2 * (y - x**2)
    return np.array([dx, dy])

Minimum at (1, 1) where f = 0. Start every optimizer at (-1.5, 2.0).

Tasks

Part A — Implement the four optimizers in src/minigrad/optim.py

Pure-function style with explicit state dicts.

def sgd_step(params, grads, state, lr):
    """state is unused for vanilla SGD; kept for API uniformity."""
    return params - lr * grads, state

def momentum_step(params, grads, state, lr, beta=0.9):
    v = state.get("v", np.zeros_like(params))
    v = beta * v + grads
    new_params = params - lr * v
    return new_params, {"v": v}

def adam_step(params, grads, state, lr, beta1=0.9, beta2=0.999, eps=1e-8):
    m = state.get("m", np.zeros_like(params))
    v = state.get("v", np.zeros_like(params))
    t = state.get("t", 0) + 1
    m = beta1 * m + (1 - beta1) * grads
    v = beta2 * v + (1 - beta2) * grads**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    new_params = params - lr * m_hat / (np.sqrt(v_hat) + eps)
    return new_params, {"m": m, "v": v, "t": t}

def adamw_step(params, grads, state, lr, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.0):
    # Adam-with-decoupled-weight-decay: same as adam_step but with weight_decay applied directly
    new_params, new_state = adam_step(params, grads, state, lr, beta1, beta2, eps)
    new_params = new_params - lr * weight_decay * params
    return new_params, new_state

Part B — Race them on Rosenbrock

def run(opt_fn, lr, n_steps=2000, start=(-1.5, 2.0), **opt_kwargs):
    p = np.array(start, dtype=float)
    state = {}
    traj = [p.copy()]
    for _ in range(n_steps):
        g = rosenbrock_grad(p)
        p, state = opt_fn(p, g, state, lr, **opt_kwargs)
        traj.append(p.copy())
    return np.array(traj)

Run with per-optimizer-tuned learning rates (since they have different stable LR ranges):

trajectories = {
    "SGD":      run(sgd_step,      lr=2e-3),
    "Momentum": run(momentum_step, lr=2e-3, beta=0.9),
    "Adam":     run(adam_step,     lr=5e-2),
    "AdamW":    run(adamw_step,    lr=5e-2, weight_decay=0.0),
}

(AdamW with weight_decay=0 is identical to Adam — that's the sanity check.)

Part C — Animate

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

# Contour
x_grid = np.linspace(-2, 2, 400)
y_grid = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x_grid, y_grid)
Z = (1 - X)**2 + 100 * (Y - X**2)**2

fig, ax = plt.subplots(figsize=(8, 6))
ax.contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap="viridis", alpha=0.5)
ax.plot(1, 1, "r*", markersize=15, label="optimum")

lines = {name: ax.plot([], [], "-", label=name)[0] for name in trajectories}
ax.legend()
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)

def init():
    for line in lines.values(): line.set_data([], [])
    return list(lines.values())

def update(frame):
    for name, line in lines.items():
        traj = trajectories[name]
        line.set_data(traj[:frame, 0], traj[:frame, 1])
    return list(lines.values())

anim = FuncAnimation(fig, update, init_func=init, frames=range(0, 2001, 20), interval=50, blit=True)
anim.save("rosenbrock-race.mp4", writer="ffmpeg")  # or .gif via PillowWriter

Part D — Interpret

Write a 200-word interpretation in experiments/04-optimizers-rosenbrock/INTERPRETATION.md. Address:

  1. SGD's path. Why does it oscillate in the steep direction?
  2. Momentum's path. Smoother, but does it overshoot? Where?
  3. Adam's path. Why does it navigate the narrow valley so much better? (Hint: per-parameter normalization by sqrt(v̂).)
  4. AdamW with weight_decay=0. Verify it tracks Adam exactly. Why is this a useful sanity check?

Part E — A second sweep with weight_decay > 0

Run AdamW with weight_decay = 0.01 and weight_decay = 0.1. Add to the plot. Expected: heavier decay pulls the trajectory toward the origin, slowing convergence to (1, 1). Annotate.

Deliverable

experiments/04-optimizers-rosenbrock/: - rosenbrock-race.mp4 (or .gif) — animation. - final_distances.json — for each optimizer, the distance from final position to (1, 1). - INTERPRETATION.md — 200-word write-up. - manifest.json — versions, seeds.

Plus a static fallback PNG of the final trajectories (in case the animation doesn't render in CI).

Acceptance

  • All four optimizers' trajectories rendered.
  • Adam's final distance to (1, 1) is < 0.1.
  • SGD's final distance is > Adam's (it will be — that's the lesson).
  • AdamW with weight_decay=0 matches Adam exactly (element-wise difference < 1e-12).
  • INTERPRETATION.md correctly attributes Adam's advantage to per-parameter scaling, not to "Adam is just better."

Pitfalls

  • Using the same LR for all four optimizers. SGD's safe LR is ~10× smaller than Adam's on Rosenbrock. If you use Adam's LR for SGD, SGD diverges. Tune per-optimizer.
  • Forgetting t += 1 in Adam. Bias correction depends on t; if t = 0, you divide by zero.
  • Using AdamW = Adam + L2. That's the wrong AdamW. Decoupled means applied to weights directly, not added to the gradient.
  • Animation rendering failures. ffmpeg may not be installed. Fall back to PillowWriter for GIF.
  • Treating "Adam wins" as a universal lesson. Adam wins here because of the narrow valley. On other landscapes (well-conditioned, low noise), SGD-with-momentum can win. The lesson is about why, not which.

Stretch

  • Add Nesterov momentum as a fifth track. Compare to plain momentum.
  • Plot loss-vs-step in a second panel. Adam should have a smoother descent curve.
  • Initialize at a different point (e.g., (2, 2)) and re-race. Some patterns hold; some don't.

Next: 03-lr-schedules.md