Skip to content

English · Español

00 — Why a calculus phase before any autograd

🇪🇸 La intuición central: backprop no es magia, es la regla de la cadena aplicada a un grafo de operaciones. Los optimizadores son descenso de gradiente con memoria. Fase 4 deriva ambos a mano para que Fase 7 (autograd) sea una implementación obvia, no una caja negra.


The lie textbooks tell

"Neural networks are trained with backpropagation, an efficient algorithm for computing gradients."

That's not wrong. It's also not what you need to know. The actual statements are:

  1. Backpropagation is just the chain rule applied to a graph of operations. That's a single sentence and it's complete. The "algorithm" is the bookkeeping: store intermediate values during the forward pass, then walk the graph in reverse to compose derivatives.
  2. Every gradient update in deep learning is weights ← weights - η ∇loss + (memory of previous updates). The variants — momentum, Adam, AdamW — differ only in what the "memory" looks like.

Until you can derive ∂ CE/∂x (the gradient of cross-entropy with respect to softmax logits) on a blank page, "training" remains mysterious. Until you can derive Adam from "first moment + second moment + bias correction," its hyperparameters are arbitrary numbers you copy from a paper. Phase 4 fixes both.

The thesis of Phase 4

Phase 4 trains two habits:

(1) Whenever you see a new loss function or new layer, write down its gradient by hand before reading anyone's autograd code.

(2) Whenever you see a new optimizer, recognize which "moments + corrections" combination it is.

The first habit guards against "I copied the loss from a tutorial; it must be right." The second guards against "Adam doesn't work; let me try the next paper's optimizer." Both habits compound across the curriculum.

What "the chain rule" really says (preview)

Single-variable case (everyone remembers this):

\[ (f \circ g)'(x) = f'(g(x)) \cdot g'(x) \]

Multivariate case (the version backprop uses):

\[ J_{f \circ g}(x) = J_f(g(x)) \cdot J_g(x) \]

where J_h denotes the Jacobian matrix of h. The product is matrix multiplication.

For a five-layer network y = L_5(L_4(L_3(L_2(L_1(x))))), the gradient of the loss with respect to the input is:

\[ \nabla_x \ell = J_{L_1}^T \cdot J_{L_2}^T \cdot J_{L_3}^T \cdot J_{L_4}^T \cdot J_{L_5}^T \cdot \nabla_y \ell \]

A product of five Jacobians. That's it. The "magic" of backprop is doing this product right-to-left (so you start with a vector and keep matrix-vector products instead of full matrix-matrix products), and storing intermediate activations from the forward pass so the Jacobians can be evaluated at the right points.

When you implement autograd in Phase 7, this is literally what you'll code. No surprises.

What "the optimizer hierarchy" really is (preview)

Optimizer Update (per parameter)
GD θ ← θ - η g
SGD same, with g = stochastic gradient
Momentum v ← β v + g; θ ← θ - η v
Nesterov momentum with a look-ahead step
AdaGrad s ← s + g²; θ ← θ - η g / sqrt(s + ε)
RMSProp s ← β s + (1-β) g²; θ ← θ - η g / sqrt(s + ε)
Adam momentum + RMSProp + bias correction
AdamW Adam + decoupled weight decay

Read top-to-bottom: each row adds one component to the row above. Adam isn't a complicated optimizer; it's "Momentum on first moment + RMSProp on second moment + bias correction at small step counts."

You shouldn't memorize the table. You should derive it: "I want to track a running average of the gradient → Momentum. I also want to scale by recent magnitude → divide by sqrt of running average of squared gradient → RMSProp. I want both → Adam. At step 1, the moving averages are biased toward zero → divide by 1 - β^t → bias correction. I want weight decay → either add λθ to gradient (Adam-with-L2, broken) or subtract ηλθ from θ directly (AdamW, correct)."

When you can produce the table by deriving it in one direction, Phase 4 is done.

Why this matters for AI specifically

Five claims that should make sense after Phase 4 but probably look like jargon now:

  1. "Adam is the default" because it adapts the per-parameter learning rate. The Hessian of a neural-network loss is anisotropic (Phase 3's condition number argument). Adam's 1/sqrt(v̂) is an approximation to a diagonal preconditioner — it scales each parameter's step by an estimate of that parameter's Hessian entry.
  2. AdamW outperforms Adam in practice because decoupled weight decay doesn't interact with the second-moment estimate. (Phase 28 — fine-tuning — will use AdamW by default.)
  3. Cosine learning-rate schedules are standard because they outperform constant LR in practice. The "why" — slower decay at the end matches the noise structure of late training — is folklore-level intuition, but the schedule itself is mechanical.
  4. Warmup helps because early gradients are dominated by initialization noise. A large LR amplifies this noise; warmup lets the network's loss surface "settle" before taking large steps. Phase 11 (initialization) will revisit.
  5. Gradient clipping is a hack that works. Clipping ||g|| to a max norm before the optimizer step prevents exploding-gradient catastrophes (mostly from RNNs/LSTMs; less critical for transformers). Phase 16 (training loop) will include it.

Every one of those statements is a calculus + optimization argument. Phase 4 makes them all derivable.

The path through Phase 4

  • Theory 01 lays the vocabulary — derivative, partial, gradient, Jacobian, Hessian.
  • Theory 02 is the engine — chain rule single-and-multi, derivation of the softmax + cross-entropy gradient. Read this twice.
  • Theory 03 derives the optimizer hierarchy from "what memory does each step have access to?"
  • Theory 04 covers LR schedules and the warmup argument.
  • Lab 00 has you derive ∂ softmax/∂x and ∂ CE/∂x on paper, then verify numerically.
  • Lab 01 has you compute a Jacobian analytically and via finite differences (preview of Phase 7's testing strategy).
  • Lab 02 is the animation lab — race optimizers, watch them succeed and fail.
  • Lab 03 is the LR schedule visualisation.

Stop here if

You are tempted to skip Phase 4 because "I've used Adam." Don't. The test is not "can you call torch.optim.Adam." The test is: can you derive ∂ CE/∂x and Adam's update on a blank page, in five minutes, no notes? If you can't yet, Phase 4 is for you.


Next: theory/01-derivatives-gradients-jacobians.md.