Skip to content

English · Español

Lab 02 — SGD (with momentum) and Adam

Goal: implement two optimizers from scratch — SGD with optional momentum, and Adam with bias correction — and cross-check Adam's trajectory against torch.optim.Adam on a toy quadratic. ~70 LOC by Borja.

Estimated time: 120–150 minutes.

Prereqs: Lab 00 + Lab 01 closed. Theory 03 read.


🇪🇸 La matemática se derivó en Phase 4; aquí solo la encarnamos en step() y zero_grad(). Lo único técnico es la corrección de sesgo de Adam, que es exactamente la suma de la serie geométrica 1 + β + β² + .... El cross-check contra torch.optim.Adam te dice si tu fórmula está en off-by-one — un error muy fácil de cometer en el contador t.

What you produce

  • src/minimodel/optim.pyOptimizer base class, SGD, Adam.
  • tests/test_optimizers.py — convergence tests and the PyTorch cross-check.

Math reference (paste into your journal first)

SGD (no momentum): θ ← θ - η · g.

SGD with momentum (PyTorch's convention, which is what we mirror):

v ← β · v + (1 - β) · g            # exponential moving average of g
θ ← θ - η · v
The "dampened" form above is what PyTorch uses when dampening != 0. With dampening=0 (default), the update is v ← β · v + g; θ ← θ - η · v. Pick the dampened form ((1-β)·g) so the moving-average interpretation is clean; document the divergence from PyTorch's default.

Adam:

t ← t + 1                                       # 1-indexed; the FIRST update sets t=1
m ← β₁ · m + (1 - β₁) · g
v ← β₂ · v + (1 - β₂) · g²
m̂ ← m / (1 - β₁^t)                              # bias correction
v̂ ← v / (1 - β₂^t)
θ ← θ - η · m̂ / (√v̂ + ε)

Defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8, lr = 1e-3.

AdamW is NOT this lab. The weight-decay-vs-L2 distinction is mentioned here so you know about it; Phase 10 implements it after deriving why L2 inside the gradient interacts badly with the Adam preconditioning.

TODOs

Block A — Optimizer base class

In src/minimodel/optim.py:

from typing import Iterable
import numpy as np
from minimodel.nn.module import Parameter


class Optimizer:
    """Base class. Materializes params as a list; per-parameter state keyed by id(p)."""

    def __init__(self, params: Iterable[Parameter], lr: float) -> None:
        # TODO: self.params = list(params). NOTE: iterables exhaust.
        # TODO: self.lr = lr.
        # TODO: self.state: dict[int, dict[str, np.ndarray]] = {id(p): {} for p in self.params}
        raise NotImplementedError

    def step(self) -> None:
        raise NotImplementedError

    def zero_grad(self) -> None:
        # TODO: for p in self.params: p.grad = None.
        raise NotImplementedError

Block B — SGD

class SGD(Optimizer):
    def __init__(
        self,
        params: Iterable[Parameter],
        lr: float,
        momentum: float = 0.0,
    ) -> None:
        super().__init__(params, lr)
        self.momentum = momentum
        if momentum > 0:
            # TODO: for each param, initialize state["velocity"] = np.zeros_like(p.data).
            raise NotImplementedError

    def step(self) -> None:
        # TODO: for each param p:
        #   if p.grad is None: continue
        #   g = p.grad
        #   if self.momentum > 0:
        #       v = self.state[id(p)]["velocity"]
        #       v = self.momentum * v + (1 - self.momentum) * g     # dampened form
        #       self.state[id(p)]["velocity"] = v
        #       update = v
        #   else:
        #       update = g
        #   p.data = p.data - self.lr * update
        raise NotImplementedError
  • p.data = p.data - self.lr * update is an out-of-place op. Do NOT use p.data -= ...: in-place mutation can confuse autograd if any tensor view still references the old buffer.

Block C — Adam

class Adam(Optimizer):
    def __init__(
        self,
        params: Iterable[Parameter],
        lr: float = 1e-3,
        betas: tuple[float, float] = (0.9, 0.999),
        eps: float = 1e-8,
    ) -> None:
        super().__init__(params, lr)
        self.beta1, self.beta2 = betas
        self.eps = eps
        # TODO: for each param, initialize state["m"] = zeros_like, state["v"] = zeros_like,
        #       state["t"] = 0 (the per-parameter step counter).
        raise NotImplementedError

    def step(self) -> None:
        # TODO: for each param p:
        #   if p.grad is None: continue
        #   st = self.state[id(p)]
        #   st["t"] += 1                                   # 1-indexed; FIRST step is t=1
        #   t = st["t"]
        #   g = p.grad
        #   st["m"] = self.beta1 * st["m"] + (1 - self.beta1) * g
        #   st["v"] = self.beta2 * st["v"] + (1 - self.beta2) * (g * g)
        #   m_hat = st["m"] / (1 - self.beta1 ** t)
        #   v_hat = st["v"] / (1 - self.beta2 ** t)
        #   p.data = p.data - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
        raise NotImplementedError
  • Off-by-one trap. The very first call to step() must use t=1, not t=0. With t=0, 1 - β^0 = 0 and you divide by zero.

Tests

In tests/test_optimizers.py:

Block D — SGD convergence

  • test_sgd_reduces_quadratic: Optimize f(θ) = (θ - 3)² from θ₀ = 0 with lr=0.1 for 100 steps. Assert |θ - 3| < 1e-3 at the end and that the loss is monotonically non-increasing.

  • test_sgd_momentum_faster_than_plain: Same quadratic. Compare plain SGD (momentum=0) to SGD with momentum=0.9. After 30 steps, momentum-SGD must be closer to the optimum (strict inequality). This is a regression test on the implementation, not a deep theorem.

  • test_sgd_zero_grad_resets: Manually set a parameter's grad, call zero_grad, assert grad is None.

Block E — Adam convergence

  • test_adam_reduces_quadratic: Same f(θ) = (θ - 3)² from θ₀ = 0. With lr=0.1 and default betas, Adam reaches |θ - 3| < 1e-2 within 200 steps.

  • test_adam_first_step_uses_t_eq_1: Sanity-check the bias correction: with lr=0.1, after ONE step on f(θ) = θ² with θ₀=1, the update magnitude is ≈ lr (because and √v̂ are both ≈ |g| after bias correction). Assert |θ - (1 - 0.1)| < 1e-6. With t=0 (the off-by-one bug), this test produces nan or wildly wrong values.

Block F — PyTorch cross-check

Constraint: PyTorch enters this test file only. It does NOT enter src/minimodel/. The cross-check exists to verify our formulas, not to import the framework into the library.

  • test_adam_matches_torch_on_quadratic:
    import torch  # test fixture only
    
    np.random.seed(0)
    torch.manual_seed(0)
    
    # Toy quadratic: minimize ||θ - target||² over a 5-dim θ.
    target_np = np.random.randn(5)
    
    # OUR Adam.
    theta = Parameter(np.zeros(5))
    our_opt = Adam([theta], lr=1e-2)
    for _ in range(100):
        our_opt.zero_grad()
        # loss = sum((theta - target)²). Use minitorch ops so .backward() populates theta.grad.
        # TODO: build the loss tensor, call loss.backward(), then our_opt.step().
        ...
    
    # PyTorch Adam.
    theta_t = torch.zeros(5, requires_grad=True)
    target_t = torch.tensor(target_np)
    torch_opt = torch.optim.Adam([theta_t], lr=1e-2)
    for _ in range(100):
        torch_opt.zero_grad()
        loss_t = ((theta_t - target_t) ** 2).sum()
        loss_t.backward()
        torch_opt.step()
    
    assert np.allclose(theta.data, theta_t.detach().numpy(), atol=1e-5)
    
    Tolerance 1e-5 over 100 steps. If the test fails by 1e-2, suspect the off-by-one in t. If it fails by 1e-6 only after many steps, suspect a missing eps or a wrong default β.

Block G — Edge cases

  • test_optimizer_with_none_grad_skips: A parameter with grad = None must not raise; step() simply skips it.

  • test_optimizer_state_isolated_per_parameter: Two parameters of identical shape must NOT share state["m"] or state["velocity"]. Verify by id(...) and by mutating one and checking the other.

  • test_optimizer_does_not_track_late_added_params: Construct an Adam([p1]), then create p2 and ensure opt.state has no key for id(p2). Document: parameters added after the optimizer is built are not tracked. PyTorch has the same behavior.

Constraints

  • No PyTorch in src/minimodel/. PyTorch is allowed in tests/test_optimizers.py only, as a reference oracle.
  • No AdamW, no learning-rate schedulers, no gradient clipping. Phase 10 + Phase 18.
  • SGD does not implement Nesterov momentum. Mention it in a comment; don't implement.
  • A13 scope: optimizers are domain-agnostic; no verb-specific code in this lab.

Pitfalls

  • Off-by-one in t. The first step() must use t=1. Test it (Block E).
  • AdamW vs Adam + L2 weight decay. Adam with L2 decay folds λθ into the gradient before the preconditioning, which means the effective decay is not constant. AdamW applies θ ← θ - lr·λ·θ outside the preconditioning. We do not implement either weight-decay form in this lab — just know they're different. Phase 10 closes the gap.
  • Shared Parameter aliasing. Two model attributes pointing at the same Parameter (Lab 00, Block F) end up in self.params twice. The optimizer updates the parameter twice per step. This is wrong for tied embeddings (Phase 17 will deduplicate). For now: document, do not fix.
  • state keyed by id(p) vs by index. id keys survive when self.params is reordered (it isn't, but the invariant matters); index keys break. Stick with id.
  • np.sqrt of a zero or near-zero . The + eps is inside the sqrt's sum: lr · m̂ / (√v̂ + ε). The PyTorch reference computes it this way; matching the parenthesization to 1e-5 requires matching the placement of eps.
  • p.grad is set to None by zero_grad, not zeroed. PyTorch convention. Test it.

Stop conditions

Done when:

  1. SGD is ≤ 30 lines, Adam is ≤ 40 lines.
  2. All tests in Blocks D–G green, including the PyTorch cross-check at 1e-5.
  3. mypy --strict src/minimodel/optim.py clean.
  4. ruff check src/minimodel/optim.py clean.
  5. You can derive the bias-correction factor 1 - β^t from the geometric series (one line in your journal).
  6. You can explain in one sentence why AdamW is not the same as Adam + L2.

When to consult solutions/

After all tests pass. solutions/02-optimizers-ref.md (at phase open) walks through the alternative formulations (PyTorch's non-dampened SGD, Nesterov momentum, AdamW) and where they would slot in.


Next lab: lab/03-train-tense-mlp.md.