Skip to content

English · Español

03 — Optimizers: SGD, momentum, Adam

🇪🇸 Phase 4 derivó los optimizadores en abstracto, sobre un dict de arrays. Phase 9 los reescribe sobre Parameters — la única diferencia es que ahora el "estado" del optimizador (los momentos para Adam, las velocidades para momentum) vive en un dict indexado por id(parameter), no por nombre. La matemática es idéntica. Lo nuevo es la interfaz step() + zero_grad() que se inserta en el bucle de entrenamiento sin que tengas que pensar.


The Optimizer base class

class Optimizer:
    """Base class for all optimizers."""

    def __init__(self, params: Iterable[Parameter], lr: float) -> None:
        self.params: list[Parameter] = list(params)
        self.lr = lr
        # Per-parameter state, keyed by id(p). Subclasses populate this.
        self.state: dict[int, dict[str, np.ndarray]] = {id(p): {} for p in self.params}

    def step(self) -> None:
        raise NotImplementedError

    def zero_grad(self) -> None:
        for p in self.params:
            p.grad = None

Twenty lines. Decisions baked in:

  • params is materialized as a list once. Iterables exhaust; the list is the optimizer's record of what it owns. Document: parameters added to the model after the optimizer is created are NOT tracked.
  • State keyed by id(p). Identity-based. The Parameter object's lifecycle and the optimizer's lifecycle are tied — if a Parameter is replaced (rare, but possible), the optimizer's state for the old id becomes stale. PyTorch handles this; we accept the limitation.
  • zero_grad lives on the optimizer too. PyTorch convention. optim.zero_grad() and model.zero_grad() do the same thing (both walk params setting grad = None).
  • No state_dict() for the optimizer yet. Phase 18 (training loop) adds it for resumable training.

SGD: the simplest possible

Plain SGD:

class SGD(Optimizer):
    def __init__(self, params, lr: float, momentum: float = 0.0) -> None:
        super().__init__(params, lr)
        self.momentum = momentum
        if momentum > 0:
            for p in self.params:
                self.state[id(p)]["velocity"] = np.zeros_like(p.data)

    def step(self) -> None:
        for p in self.params:
            if p.grad is None:
                continue  # parameter wasn't touched by this backward pass
            if self.momentum > 0:
                v = self.state[id(p)]["velocity"]
                v *= self.momentum
                v += p.grad
                p.data -= self.lr * v
            else:
                p.data -= self.lr * p.grad

Twenty lines. Key points:

  • Velocity stored in self.state, not on the Parameter. Keeps Parameters clean (just data and grad).
  • In-place updates to p.data and v are fine. They happen outside the autograd graph (between forward passes). The DAG is recomputed every iteration.
  • if p.grad is None: continue. Some parameters may not appear in the loss for a particular batch (rare in small MLPs; common in transformer mixture-of-experts setups). Phase 9 skips them; Phase 18 may warn.

Why this update rule

For a loss L(θ), the gradient ∇L is the direction of steepest ascent. We want to descend, so we take θ ← θ - η · ∇L with learning rate η.

Momentum adds a smoothing of the gradient over time:

v_t = μ · v_{t-1} + ∇L
θ_t = θ_{t-1} - η · v_t

Equivalent to: instead of stepping in the direction of the current gradient, step in the direction of an exponentially-weighted moving average. Smooths over noisy minibatch gradients; accelerates convergence in low-curvature directions.

Phase 4 derived all of this. Phase 9 implements it.

Adam: adaptive moment estimation

The full formula (Phase 4 derived; here we just code it):

m_t = β₁ · m_{t-1} + (1 - β₁) · g_t           # first moment (running mean of gradients)
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²          # second moment (running variance, element-wise)
m̂_t = m_t / (1 - β₁^t)                        # bias correction
v̂_t = v_t / (1 - β₂^t)                        # bias correction
θ_t = θ_{t-1} - η · m̂_t / (√v̂_t + ε)

In code:

class Adam(Optimizer):
    def __init__(
        self,
        params,
        lr: float = 1e-3,
        betas: tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-8,
    ) -> None:
        super().__init__(params, lr)
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.t = 0  # global step counter
        for p in self.params:
            self.state[id(p)]["m"] = np.zeros_like(p.data)
            self.state[id(p)]["v"] = np.zeros_like(p.data)

    def step(self) -> None:
        self.t += 1
        for p in self.params:
            if p.grad is None:
                continue
            s = self.state[id(p)]
            g = p.grad
            s["m"] = self.beta1 * s["m"] + (1 - self.beta1) * g
            s["v"] = self.beta2 * s["v"] + (1 - self.beta2) * g * g
            m_hat = s["m"] / (1 - self.beta1 ** self.t)
            v_hat = s["v"] / (1 - self.beta2 ** self.t)
            p.data -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Thirty lines. Notes:

  • betas = (0.9, 0.999) and eps = 1e-8 are PyTorch defaults. Match them so the port drill is clean.
  • Global step counter self.t. Increments once per step() call. PyTorch keeps t per parameter; identical behavior in practice since all params see the same number of steps.
  • Bias correction matters most early. At step t=1, 1 - β₁^1 = 0.1, so m̂ = m / 0.1 = 10 · m. Without correction, the update would be too small. After ~50 steps, the correction is negligible.

Why bias correction?

Initialize m_0 = 0. After one step: m_1 = (1 - β₁) · g_1. This is biased toward zero — the true running mean should be g_1, but we have 0.1 · g_1. Dividing by 1 - β₁^1 = 0.1 restores the unbiased estimate.

The geometric series argument: m_t = (1 - β₁) · Σ_{k=1..t} β₁^{t-k} · g_k. Sum of weights = (1 - β₁) · (1 - β₁^t) / (1 - β₁) = 1 - β₁^t. So m_t / (1 - β₁^t) is the weighted average with weights summing to 1 — the unbiased estimate of the running mean.

This is the derivation Borja must reproduce in /quiz 09.

Why √v̂ in the denominator?

Adam's denominator √v̂_t is an estimate of the gradient's RMS magnitude per coordinate. Dividing the update by it gives per-coordinate adaptive step sizes: coordinates with consistently large gradients get smaller effective learning rates; coordinates with small gradients get larger ones. The intuition: every coordinate sees the same effective relative step.

The + ε prevents division by zero (Phase 2's catastrophic-cancellation lesson: never divide by something you haven't bounded).

Cross-check against PyTorch

Adam's correctness is hard to spot-check by eye — small numerical differences compound. The DoD requires Adam matches torch.optim.Adam to 1e-5 over 100 steps on a quadratic.

def test_adam_matches_pytorch():
    rng = np.random.default_rng(0)
    init = rng.standard_normal(5).astype(np.float64)
    target = rng.standard_normal(5).astype(np.float64)

    # Ours
    p = Parameter(init.copy())
    optim = Adam([p], lr=1e-2)
    for _ in range(100):
        optim.zero_grad()
        loss = ((p - Tensor(target)) ** 2).sum()
        loss.backward()
        optim.step()

    # PyTorch
    pt = torch.tensor(init, dtype=torch.float64, requires_grad=True)
    opt_t = torch.optim.Adam([pt], lr=1e-2)
    target_t = torch.tensor(target, dtype=torch.float64)
    for _ in range(100):
        opt_t.zero_grad()
        loss_t = ((pt - target_t) ** 2).sum()
        loss_t.backward()
        opt_t.step()

    np.testing.assert_allclose(p.data, pt.detach().numpy(), atol=1e-5)

If this test passes, Adam is correct.

Common pitfalls

  1. Reusing a stale params list. Add a layer to the model after creating the optimizer → the new layer's parameters don't get updated. Always create the optimizer after the model is fully constructed. Document this conspicuously.
  2. Forgetting optim.zero_grad(). Gradients accumulate (Phase 8 + Phase 9 are accumulator-style by default). Without zero_grad, every step uses the sum of all past gradients — loss diverges immediately. Lab 02 reproduces this bug deliberately.
  3. Calling optim.step() before any backward. No p.grad exists yet. The if p.grad is None: continue guard prevents a crash, but it's a sign of a logic error in the training loop.
  4. Adam's t counter not incremented. If you forget self.t += 1, bias correction is always for t=0, giving 1 - β^0 = 0, division by zero. Catch with the cross-check test.
  5. Mixing FP64 parameters with FP32 gradients. Possible because Tensor doesn't enforce dtype consistency. Phase 9 is FP64 throughout for cleanliness; Phase 26 (mixed precision) revisits.

Performance notes

For a 469-parameter MLP, Adam's overhead is: - 2 extra arrays per parameter (m, v): 2 · 469 · 8 bytes = 7.5 KB. - Per step: ~5 arithmetic ops per parameter element. ~2.5K ops. Microseconds.

Negligible. For Phase 17's transformer (~10M params), the same calculation: 80 MB extra state, ~50M ops/step. Still negligible compared to the matmul. Adam is essentially free as long as the model fits in memory.

PyTorch's Adam is functionally identical to ours but written in CUDA kernels. Phase 25's Adam exercise compares them.

Pitfalls (will bite in lab)

  • Off-by-one in t. Increment before computing 1 - β^t, not after.
  • Beta defaults. (0.9, 0.999) not (0.999, 0.9). Easy to swap.
  • Eps placement. (np.sqrt(v_hat) + eps), not np.sqrt(v_hat + eps). The former is more stable (avoids sqrt(0) exactly); both are used in literature; PyTorch uses the former; match it.
  • Reusing the same Adam instance after replacing parameters. The state dict keys (ids) become stale. Don't do this; create a new optimizer.
  • Adam with very small lr (1e-6) + bias correction: the effective first step is still large due to the 1 / (1 - β₁) ≈ 10 correction. Lab will exhibit this.

Topic anchor (§A13)

Phase 9's training of TenseMLP uses Adam(lr=1e-2) (default betas=(0.9, 0.999), eps=1e-8). Convergence: <30 epochs to >90% train accuracy on the 250-triple training set; >85% on the 50-triple held-out validation set. Lab 03 reports the loss curve and confusion matrix.

The MLP's parameter count (469) is small enough that you can also try SGD(lr=0.5) (high lr because few params, well-conditioned loss) and see comparable convergence — useful for the "SGD vs Adam" diagnostic plot.

What this page does NOT cover

  • Weight decay / L2 regularization. Phase 10 (AdamW vs Adam-with-L2 — they differ in implementation).
  • Learning rate scheduling. Phase 18 (warmup, cosine decay).
  • Gradient clipping. Phase 18.
  • Optimizer state checkpointing. Phase 18.
  • Sharded optimizers (ZeRO). Phase 35 (distributed training).

One-paragraph recap

SGD is a 5-line update: p.data -= lr * p.grad, optionally with momentum (v ← μv + g; p ← p - lr·v). Adam is the same loop with per-parameter running estimates of the first moment (m) and second moment (v), bias-corrected by 1 - β^t, and the update normalized by √v̂. Both fit in a step() method following the Optimizer(params, lr) interface. State is stored in optim.state[id(p)]. The DoD requires Adam matches PyTorch to 1e-5 over 100 steps — this catches every common bug (wrong beta order, missing bias correction, wrong eps placement) at once.


Next: lab/00-parameter-and-module-skeleton.md.