Skip to content

English · Español

Break — Remove the momentum term from SGD

🇪🇸 Quita el momentum (β = 0) de un optimizador SGD que estaba convergiendo en un valle estrecho. Observa cómo la pérdida zig-zaguea y converge más lento. La gráfica de la pérdida es el mejor profesor.

Target: any SGD-with-momentum implementation. We use the Rosenbrock optimizer lab (lab/02-optimizers-on-rosenbrock.md) as the laboratory because its anisotropic loss makes the effect dramatic.

Hypothesis

The learner predicts: "Setting β = 0 (no momentum) in SGD on the Rosenbrock function will (a) require many more steps to reach the same loss, and (b) trace a visibly zig-zagging trajectory across the curved valley." The loss curve will still decrease — it's not catastrophic — but convergence will be 5–20× slower.

The break

In your SGD-with-momentum optimizer:

 class SGDMomentum:
-    def __init__(self, params, lr=0.001, beta=0.9):
+    def __init__(self, params, lr=0.001, beta=0.0):    # /break: momentum off
         self.params = params
         self.lr = lr
         self.beta = beta
         self.v = [np.zeros_like(p) for p in params]

     def step(self, grads):
         for i, (p, g) in enumerate(zip(self.params, grads)):
             self.v[i] = self.beta * self.v[i] + g
             p -= self.lr * self.v[i]

(With beta = 0, the velocity collapses to v = g — i.e. plain SGD.)

Run procedure

uv run python -c "
import numpy as np

def rosenbrock(x):
    return (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2

def grad_rosenbrock(x):
    dx = -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2)
    dy = 200*(x[1] - x[0]**2)
    return np.array([dx, dy])

for beta in (0.0, 0.9):
    x = np.array([-1.5, 1.5])
    lr = 1e-3
    v = np.zeros_like(x)
    for step in range(5000):
        g = grad_rosenbrock(x)
        v = beta * v + g
        x = x - lr * v
        if step in (0, 100, 1000, 4999):
            print(f'beta={beta:.1f}  step={step:5d}  x={x}  loss={rosenbrock(x):.4f}')
    print()
"

Expected failure mode

With β = 0.9 (correct):

beta=0.9  step=    0  x=[-1.4948  1.5006]  loss= 4.2
beta=0.9  step=  100  x=[ 0.34    0.10  ]  loss= 0.44
beta=0.9  step= 1000  x=[ 0.91    0.83  ]  loss= 0.008
beta=0.9  step= 4999  x=[ 0.999   0.999 ]  loss= 1e-6

With β = 0 (broken):

beta=0.0  step=    0  x=[-1.4955  1.4994]  loss= 4.6
beta=0.0  step=  100  x=[-0.78    0.69  ]  loss= 3.2     <-- barely moved
beta=0.0  step= 1000  x=[ 0.13    0.02  ]  loss= 0.76    <-- 100× slower
beta=0.0  step= 4999  x=[ 0.65    0.42  ]  loss= 0.12    <-- still not converged

Quantitative signature: at step 1000, β=0.9 reaches loss < 0.01; β=0 is still above 0.5. 50× difference in convergence speed.

Diagnostic

From logs alone:

  1. Plot x[0] over time for both runs. β=0 traces a sawtooth pattern (zig-zag across the parabolic valley); β=0.9 traces a smooth curve. If you see oscillation in successive steps' coordinates, you have no (or too little) momentum.
  2. Plot the gradient norm. With β=0 the per-step gradient is large but cancels with the next step. With momentum, the accumulated velocity keeps moving in the same direction.

Lesson

In an anisotropic loss surface, the gradient direction alternates between "across the valley" (loud, alternating sign) and "along the valley" (quiet, consistent sign). Plain SGD treats both equally and zig-zags. Momentum averages the alternating component to zero and reinforces the consistent component.

The fix is one line: v = β v + g, with β = 0.9 as the canonical default. This is the cheapest single improvement in the optimizer hierarchy and it explains why every modern optimizer (Adam, AdamW, Lion) keeps a velocity term in some form.

References

  • Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics, 1964 (the original heavy-ball method).
  • Sutskever et al., On the importance of initialization and momentum in deep learning, ICML 2013 — empirical confirmation on deep nets.