English · Español
Lab 02 — SGD (with momentum) and Adam¶
Goal: implement two optimizers from scratch —
SGDwith optional momentum, andAdamwith bias correction — and cross-checkAdam's trajectory againsttorch.optim.Adamon a toy quadratic. ~70 LOC by Borja.Estimated time: 120–150 minutes.
Prereqs: Lab 00 + Lab 01 closed. Theory 03 read.
🇪🇸 La matemática se derivó en Phase 4; aquí solo la encarnamos en
step()yzero_grad(). Lo único técnico es la corrección de sesgo de Adam, que es exactamente la suma de la serie geométrica1 + β + β² + .... El cross-check contratorch.optim.Adamte dice si tu fórmula está en off-by-one — un error muy fácil de cometer en el contadort.
What you produce¶
src/minimodel/optim.py—Optimizerbase class,SGD,Adam.tests/test_optimizers.py— convergence tests and the PyTorch cross-check.
Math reference (paste into your journal first)¶
SGD (no momentum): θ ← θ - η · g.
SGD with momentum (PyTorch's convention, which is what we mirror):
The "dampened" form above is what PyTorch uses whendampening != 0. With dampening=0 (default), the update is v ← β · v + g; θ ← θ - η · v. Pick the dampened form ((1-β)·g) so the moving-average interpretation is clean; document the divergence from PyTorch's default.
Adam:
t ← t + 1 # 1-indexed; the FIRST update sets t=1
m ← β₁ · m + (1 - β₁) · g
v ← β₂ · v + (1 - β₂) · g²
m̂ ← m / (1 - β₁^t) # bias correction
v̂ ← v / (1 - β₂^t)
θ ← θ - η · m̂ / (√v̂ + ε)
Defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8, lr = 1e-3.
AdamW is NOT this lab. The weight-decay-vs-L2 distinction is mentioned here so you know about it; Phase 10 implements it after deriving why L2 inside the gradient interacts badly with the Adam preconditioning.
TODOs¶
Block A — Optimizer base class¶
In src/minimodel/optim.py:
from typing import Iterable
import numpy as np
from minimodel.nn.module import Parameter
class Optimizer:
"""Base class. Materializes params as a list; per-parameter state keyed by id(p)."""
def __init__(self, params: Iterable[Parameter], lr: float) -> None:
# TODO: self.params = list(params). NOTE: iterables exhaust.
# TODO: self.lr = lr.
# TODO: self.state: dict[int, dict[str, np.ndarray]] = {id(p): {} for p in self.params}
raise NotImplementedError
def step(self) -> None:
raise NotImplementedError
def zero_grad(self) -> None:
# TODO: for p in self.params: p.grad = None.
raise NotImplementedError
Block B — SGD¶
class SGD(Optimizer):
def __init__(
self,
params: Iterable[Parameter],
lr: float,
momentum: float = 0.0,
) -> None:
super().__init__(params, lr)
self.momentum = momentum
if momentum > 0:
# TODO: for each param, initialize state["velocity"] = np.zeros_like(p.data).
raise NotImplementedError
def step(self) -> None:
# TODO: for each param p:
# if p.grad is None: continue
# g = p.grad
# if self.momentum > 0:
# v = self.state[id(p)]["velocity"]
# v = self.momentum * v + (1 - self.momentum) * g # dampened form
# self.state[id(p)]["velocity"] = v
# update = v
# else:
# update = g
# p.data = p.data - self.lr * update
raise NotImplementedError
-
p.data = p.data - self.lr * updateis an out-of-place op. Do NOT usep.data -= ...: in-place mutation can confuse autograd if any tensor view still references the old buffer.
Block C — Adam¶
class Adam(Optimizer):
def __init__(
self,
params: Iterable[Parameter],
lr: float = 1e-3,
betas: tuple[float, float] = (0.9, 0.999),
eps: float = 1e-8,
) -> None:
super().__init__(params, lr)
self.beta1, self.beta2 = betas
self.eps = eps
# TODO: for each param, initialize state["m"] = zeros_like, state["v"] = zeros_like,
# state["t"] = 0 (the per-parameter step counter).
raise NotImplementedError
def step(self) -> None:
# TODO: for each param p:
# if p.grad is None: continue
# st = self.state[id(p)]
# st["t"] += 1 # 1-indexed; FIRST step is t=1
# t = st["t"]
# g = p.grad
# st["m"] = self.beta1 * st["m"] + (1 - self.beta1) * g
# st["v"] = self.beta2 * st["v"] + (1 - self.beta2) * (g * g)
# m_hat = st["m"] / (1 - self.beta1 ** t)
# v_hat = st["v"] / (1 - self.beta2 ** t)
# p.data = p.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
raise NotImplementedError
- Off-by-one trap. The very first call to
step()must uset=1, nott=0. Witht=0,1 - β^0 = 0and you divide by zero.
Tests¶
In tests/test_optimizers.py:
Block D — SGD convergence¶
-
test_sgd_reduces_quadratic: Optimizef(θ) = (θ - 3)²fromθ₀ = 0withlr=0.1for 100 steps. Assert|θ - 3| < 1e-3at the end and that the loss is monotonically non-increasing. -
test_sgd_momentum_faster_than_plain: Same quadratic. Compare plain SGD (momentum=0) to SGD withmomentum=0.9. After 30 steps, momentum-SGD must be closer to the optimum (strict inequality). This is a regression test on the implementation, not a deep theorem. -
test_sgd_zero_grad_resets: Manually set a parameter'sgrad, callzero_grad, assertgrad is None.
Block E — Adam convergence¶
-
test_adam_reduces_quadratic: Samef(θ) = (θ - 3)²fromθ₀ = 0. Withlr=0.1and default betas, Adam reaches|θ - 3| < 1e-2within 200 steps. -
test_adam_first_step_uses_t_eq_1: Sanity-check the bias correction: withlr=0.1, after ONE step onf(θ) = θ²withθ₀=1, the update magnitude is≈ lr(becausem̂and√v̂are both ≈ |g| after bias correction). Assert|θ - (1 - 0.1)| < 1e-6. Witht=0(the off-by-one bug), this test producesnanor wildly wrong values.
Block F — PyTorch cross-check¶
Constraint: PyTorch enters this test file only. It does NOT enter
src/minimodel/. The cross-check exists to verify our formulas, not to import the framework into the library.
-
test_adam_matches_torch_on_quadratic:Toleranceimport torch # test fixture only np.random.seed(0) torch.manual_seed(0) # Toy quadratic: minimize ||θ - target||² over a 5-dim θ. target_np = np.random.randn(5) # OUR Adam. theta = Parameter(np.zeros(5)) our_opt = Adam([theta], lr=1e-2) for _ in range(100): our_opt.zero_grad() # loss = sum((theta - target)²). Use minitorch ops so .backward() populates theta.grad. # TODO: build the loss tensor, call loss.backward(), then our_opt.step(). ... # PyTorch Adam. theta_t = torch.zeros(5, requires_grad=True) target_t = torch.tensor(target_np) torch_opt = torch.optim.Adam([theta_t], lr=1e-2) for _ in range(100): torch_opt.zero_grad() loss_t = ((theta_t - target_t) ** 2).sum() loss_t.backward() torch_opt.step() assert np.allclose(theta.data, theta_t.detach().numpy(), atol=1e-5)1e-5over 100 steps. If the test fails by1e-2, suspect the off-by-one int. If it fails by1e-6only after many steps, suspect a missingepsor a wrong default β.
Block G — Edge cases¶
-
test_optimizer_with_none_grad_skips: A parameter withgrad = Nonemust not raise;step()simply skips it. -
test_optimizer_state_isolated_per_parameter: Two parameters of identical shape must NOT sharestate["m"]orstate["velocity"]. Verify byid(...)and by mutating one and checking the other. -
test_optimizer_does_not_track_late_added_params: Construct anAdam([p1]), then createp2and ensureopt.statehas no key forid(p2). Document: parameters added after the optimizer is built are not tracked. PyTorch has the same behavior.
Constraints¶
- No PyTorch in
src/minimodel/. PyTorch is allowed intests/test_optimizers.pyonly, as a reference oracle. - No
AdamW, no learning-rate schedulers, no gradient clipping. Phase 10 + Phase 18. SGDdoes not implement Nesterov momentum. Mention it in a comment; don't implement.- A13 scope: optimizers are domain-agnostic; no verb-specific code in this lab.
Pitfalls¶
- Off-by-one in
t. The firststep()must uset=1. Test it (Block E). - AdamW vs Adam + L2 weight decay. Adam with L2 decay folds
λθinto the gradient before the preconditioning, which means the effective decay is not constant. AdamW appliesθ ← θ - lr·λ·θoutside the preconditioning. We do not implement either weight-decay form in this lab — just know they're different. Phase 10 closes the gap. - Shared
Parameteraliasing. Two model attributes pointing at the sameParameter(Lab 00, Block F) end up inself.paramstwice. The optimizer updates the parameter twice per step. This is wrong for tied embeddings (Phase 17 will deduplicate). For now: document, do not fix. statekeyed byid(p)vs by index.idkeys survive whenself.paramsis reordered (it isn't, but the invariant matters); index keys break. Stick withid.np.sqrtof a zero or near-zerov̂. The+ epsis inside thesqrt's sum:lr · m̂ / (√v̂ + ε). The PyTorch reference computes it this way; matching the parenthesization to1e-5requires matching the placement ofeps.p.gradis set toNonebyzero_grad, not zeroed. PyTorch convention. Test it.
Stop conditions¶
Done when:
SGDis ≤ 30 lines,Adamis ≤ 40 lines.- All tests in Blocks D–G green, including the PyTorch cross-check at
1e-5. mypy --strict src/minimodel/optim.pyclean.ruff check src/minimodel/optim.pyclean.- You can derive the bias-correction factor
1 - β^tfrom the geometric series (one line in your journal). - You can explain in one sentence why
AdamWis not the same asAdam + L2.
When to consult solutions/¶
After all tests pass. solutions/02-optimizers-ref.md (at phase open) walks through the alternative formulations (PyTorch's non-dampened SGD, Nesterov momentum, AdamW) and where they would slot in.
Next lab: lab/03-train-tense-mlp.md.