English · Español
03 — Optimizers: SGD, momentum, Adam¶
🇪🇸 Phase 4 derivó los optimizadores en abstracto, sobre un
dictde arrays. Phase 9 los reescribe sobreParameters — la única diferencia es que ahora el "estado" del optimizador (los momentos para Adam, las velocidades para momentum) vive en un dict indexado porid(parameter), no por nombre. La matemática es idéntica. Lo nuevo es la interfazstep()+zero_grad()que se inserta en el bucle de entrenamiento sin que tengas que pensar.
The Optimizer base class¶
class Optimizer:
"""Base class for all optimizers."""
def __init__(self, params: Iterable[Parameter], lr: float) -> None:
self.params: list[Parameter] = list(params)
self.lr = lr
# Per-parameter state, keyed by id(p). Subclasses populate this.
self.state: dict[int, dict[str, np.ndarray]] = {id(p): {} for p in self.params}
def step(self) -> None:
raise NotImplementedError
def zero_grad(self) -> None:
for p in self.params:
p.grad = None
Twenty lines. Decisions baked in:
paramsis materialized as a list once. Iterables exhaust; the list is the optimizer's record of what it owns. Document: parameters added to the model after the optimizer is created are NOT tracked.- State keyed by
id(p). Identity-based. The Parameter object's lifecycle and the optimizer's lifecycle are tied — if a Parameter is replaced (rare, but possible), the optimizer's state for the oldidbecomes stale. PyTorch handles this; we accept the limitation. zero_gradlives on the optimizer too. PyTorch convention.optim.zero_grad()andmodel.zero_grad()do the same thing (both walkparamssettinggrad = None).- No
state_dict()for the optimizer yet. Phase 18 (training loop) adds it for resumable training.
SGD: the simplest possible¶
Plain SGD:
class SGD(Optimizer):
def __init__(self, params, lr: float, momentum: float = 0.0) -> None:
super().__init__(params, lr)
self.momentum = momentum
if momentum > 0:
for p in self.params:
self.state[id(p)]["velocity"] = np.zeros_like(p.data)
def step(self) -> None:
for p in self.params:
if p.grad is None:
continue # parameter wasn't touched by this backward pass
if self.momentum > 0:
v = self.state[id(p)]["velocity"]
v *= self.momentum
v += p.grad
p.data -= self.lr * v
else:
p.data -= self.lr * p.grad
Twenty lines. Key points:
- Velocity stored in
self.state, not on the Parameter. Keeps Parameters clean (justdataandgrad). - In-place updates to
p.dataandvare fine. They happen outside the autograd graph (between forward passes). The DAG is recomputed every iteration. if p.grad is None: continue. Some parameters may not appear in the loss for a particular batch (rare in small MLPs; common in transformer mixture-of-experts setups). Phase 9 skips them; Phase 18 may warn.
Why this update rule¶
For a loss L(θ), the gradient ∇L is the direction of steepest ascent. We want to descend, so we take θ ← θ - η · ∇L with learning rate η.
Momentum adds a smoothing of the gradient over time:
Equivalent to: instead of stepping in the direction of the current gradient, step in the direction of an exponentially-weighted moving average. Smooths over noisy minibatch gradients; accelerates convergence in low-curvature directions.
Phase 4 derived all of this. Phase 9 implements it.
Adam: adaptive moment estimation¶
The full formula (Phase 4 derived; here we just code it):
m_t = β₁ · m_{t-1} + (1 - β₁) · g_t # first moment (running mean of gradients)
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t² # second moment (running variance, element-wise)
m̂_t = m_t / (1 - β₁^t) # bias correction
v̂_t = v_t / (1 - β₂^t) # bias correction
θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε)
In code:
class Adam(Optimizer):
def __init__(
self,
params,
lr: float = 1e-3,
betas: tuple[float, float] = (0.9, 0.999),
eps: float = 1e-8,
) -> None:
super().__init__(params, lr)
self.beta1, self.beta2 = betas
self.eps = eps
self.t = 0 # global step counter
for p in self.params:
self.state[id(p)]["m"] = np.zeros_like(p.data)
self.state[id(p)]["v"] = np.zeros_like(p.data)
def step(self) -> None:
self.t += 1
for p in self.params:
if p.grad is None:
continue
s = self.state[id(p)]
g = p.grad
s["m"] = self.beta1 * s["m"] + (1 - self.beta1) * g
s["v"] = self.beta2 * s["v"] + (1 - self.beta2) * g * g
m_hat = s["m"] / (1 - self.beta1 ** self.t)
v_hat = s["v"] / (1 - self.beta2 ** self.t)
p.data -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
Thirty lines. Notes:
betas = (0.9, 0.999)andeps = 1e-8are PyTorch defaults. Match them so the port drill is clean.- Global step counter
self.t. Increments once perstep()call. PyTorch keepstper parameter; identical behavior in practice since all params see the same number of steps. - Bias correction matters most early. At step
t=1,1 - β₁^1 = 0.1, som̂ = m / 0.1 = 10 · m. Without correction, the update would be too small. After ~50 steps, the correction is negligible.
Why bias correction?¶
Initialize m_0 = 0. After one step: m_1 = (1 - β₁) · g_1. This is biased toward zero — the true running mean should be g_1, but we have 0.1 · g_1. Dividing by 1 - β₁^1 = 0.1 restores the unbiased estimate.
The geometric series argument: m_t = (1 - β₁) · Σ_{k=1..t} β₁^{t-k} · g_k. Sum of weights = (1 - β₁) · (1 - β₁^t) / (1 - β₁) = 1 - β₁^t. So m_t / (1 - β₁^t) is the weighted average with weights summing to 1 — the unbiased estimate of the running mean.
This is the derivation Borja must reproduce in /quiz 09.
Why √v̂ in the denominator?¶
Adam's denominator √v̂_t is an estimate of the gradient's RMS magnitude per coordinate. Dividing the update by it gives per-coordinate adaptive step sizes: coordinates with consistently large gradients get smaller effective learning rates; coordinates with small gradients get larger ones. The intuition: every coordinate sees the same effective relative step.
The + ε prevents division by zero (Phase 2's catastrophic-cancellation lesson: never divide by something you haven't bounded).
Cross-check against PyTorch¶
Adam's correctness is hard to spot-check by eye — small numerical differences compound. The DoD requires Adam matches torch.optim.Adam to 1e-5 over 100 steps on a quadratic.
def test_adam_matches_pytorch():
rng = np.random.default_rng(0)
init = rng.standard_normal(5).astype(np.float64)
target = rng.standard_normal(5).astype(np.float64)
# Ours
p = Parameter(init.copy())
optim = Adam([p], lr=1e-2)
for _ in range(100):
optim.zero_grad()
loss = ((p - Tensor(target)) ** 2).sum()
loss.backward()
optim.step()
# PyTorch
pt = torch.tensor(init, dtype=torch.float64, requires_grad=True)
opt_t = torch.optim.Adam([pt], lr=1e-2)
target_t = torch.tensor(target, dtype=torch.float64)
for _ in range(100):
opt_t.zero_grad()
loss_t = ((pt - target_t) ** 2).sum()
loss_t.backward()
opt_t.step()
np.testing.assert_allclose(p.data, pt.detach().numpy(), atol=1e-5)
If this test passes, Adam is correct.
Common pitfalls¶
- Reusing a stale
paramslist. Add a layer to the model after creating the optimizer → the new layer's parameters don't get updated. Always create the optimizer after the model is fully constructed. Document this conspicuously. - Forgetting
optim.zero_grad(). Gradients accumulate (Phase 8 + Phase 9 are accumulator-style by default). Without zero_grad, every step uses the sum of all past gradients — loss diverges immediately. Lab 02 reproduces this bug deliberately. - Calling
optim.step()before any backward. Nop.gradexists yet. Theif p.grad is None: continueguard prevents a crash, but it's a sign of a logic error in the training loop. - Adam's
tcounter not incremented. If you forgetself.t += 1, bias correction is always fort=0, giving1 - β^0 = 0, division by zero. Catch with the cross-check test. - Mixing FP64 parameters with FP32 gradients. Possible because
Tensordoesn't enforce dtype consistency. Phase 9 is FP64 throughout for cleanliness; Phase 26 (mixed precision) revisits.
Performance notes¶
For a 469-parameter MLP, Adam's overhead is:
- 2 extra arrays per parameter (m, v): 2 · 469 · 8 bytes = 7.5 KB.
- Per step: ~5 arithmetic ops per parameter element. ~2.5K ops. Microseconds.
Negligible. For Phase 17's transformer (~10M params), the same calculation: 80 MB extra state, ~50M ops/step. Still negligible compared to the matmul. Adam is essentially free as long as the model fits in memory.
PyTorch's Adam is functionally identical to ours but written in CUDA kernels. Phase 25's Adam exercise compares them.
Pitfalls (will bite in lab)¶
- Off-by-one in
t. Increment before computing1 - β^t, not after. - Beta defaults.
(0.9, 0.999)not(0.999, 0.9). Easy to swap. - Eps placement.
(np.sqrt(v_hat) + eps), notnp.sqrt(v_hat + eps). The former is more stable (avoidssqrt(0)exactly); both are used in literature; PyTorch uses the former; match it. - Reusing the same
Adaminstance after replacing parameters. The state dict keys (ids) become stale. Don't do this; create a new optimizer. - Adam with very small
lr(1e-6) + bias correction: the effective first step is still large due to the1 / (1 - β₁) ≈ 10correction. Lab will exhibit this.
Topic anchor (§A13)¶
Phase 9's training of TenseMLP uses Adam(lr=1e-2) (default betas=(0.9, 0.999), eps=1e-8). Convergence: <30 epochs to >90% train accuracy on the 250-triple training set; >85% on the 50-triple held-out validation set. Lab 03 reports the loss curve and confusion matrix.
The MLP's parameter count (469) is small enough that you can also try SGD(lr=0.5) (high lr because few params, well-conditioned loss) and see comparable convergence — useful for the "SGD vs Adam" diagnostic plot.
What this page does NOT cover¶
- Weight decay / L2 regularization. Phase 10 (AdamW vs Adam-with-L2 — they differ in implementation).
- Learning rate scheduling. Phase 18 (warmup, cosine decay).
- Gradient clipping. Phase 18.
- Optimizer state checkpointing. Phase 18.
- Sharded optimizers (ZeRO). Phase 35 (distributed training).
One-paragraph recap¶
SGD is a 5-line update: p.data -= lr * p.grad, optionally with momentum (v ← μv + g; p ← p - lr·v). Adam is the same loop with per-parameter running estimates of the first moment (m) and second moment (v), bias-corrected by 1 - β^t, and the update normalized by √v̂. Both fit in a step() method following the Optimizer(params, lr) interface. State is stored in optim.state[id(p)]. The DoD requires Adam matches PyTorch to 1e-5 over 100 steps — this catches every common bug (wrong beta order, missing bias correction, wrong eps placement) at once.
Next: lab/00-parameter-and-module-skeleton.md.