Skip to content

English · Español

Phase 04 — Calculus & Optimization for AI

Requires: 03 — Linear Algebra from First Principles Teaches: gradients · chain-rule · backprop · sgd · momentum · adam Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory + lab statements are stable drafts; solutions land just-in-time at phase open.

🇪🇸 Backprop = regla de la cadena, mecánicamente. Adam = SGD con dos medias móviles y una corrección de sesgo. Esta fase deriva ambos a mano, los implementa, y los visualiza corriendo sobre Rosenbrock.


Goal

Internalize the calculus that drives backpropagation and the optimization that drives training:

  • Derive ∂/∂x softmax(x)_i and ∂/∂x CE(softmax(x), y) on a blank page.
  • Derive Adam's update rule from "running mean of gradient, running mean of squared gradient, bias-corrected."
  • Implement four optimizers in src/minigrad/optim.py (SGD, momentum, Adam, AdamW) as pure functions operating on state dicts.
  • Animate their trajectories on the Rosenbrock function and articulate why each behaves the way it does.
  • Explain why learning-rate warmup helps as a Hessian-conditioning argument, not as a heuristic.

This is the calculus prerequisite for Phase 7 (scalar autograd from scratch) and Phase 16 (the first real training loop).

Read order

  1. theory/00-motivation.md — why a calculus phase before any autograd.
  2. theory/01-derivatives-gradients-jacobians.md — the vocabulary of multivariate calculus, with the explicit picture of Jacobian-as-matrix.
  3. theory/02-chain-rule-and-backprop.md — chain rule single-and-multivariate, the explicit derivation of ∂ CE/∂x for softmax + cross-entropy. The most important page in Phase 4.
  4. theory/03-optimizers-derived.md — gradient descent → SGD → momentum → Adam → AdamW. Each derived, not memorized.
  5. theory/04-lr-schedules-and-warmup.md — constant / step / cosine / warmup; the Hessian-conditioning argument for why warmup helps.
  6. lab/00-derive-softmax-gradient.md — derive ∂ softmax/∂x and ∂ CE/∂x on paper; verify numerically.
  7. lab/01-jacobian-by-hand.md — compute a Jacobian analytically and via finite differences; verify match.
  8. lab/02-optimizers-on-rosenbrock.md — race four optimizers; animate.
  9. lab/03-lr-schedules.md — implement and plot five LR schedules.

solutions/ is empty at pre-write — populated at phase open.

Definition of Done

See PHASE_04_PLAN.md §6. Briefly:

  • src/minigrad/optim.py exists with four step functions; tests pass.
  • Rosenbrock animation committed.
  • LR schedule chart committed.
  • Jacobian analytical-vs-numerical verification within 1e-4.
  • Borja can derive Adam, derive ∂ softmax/∂x, and explain warmup.

What this phase intentionally does NOT cover

  • Autograd implementation. Phase 7 (scalar) and Phase 8 (tensor).
  • Hessian computation in code. Theory only; the work is in Phase 7+ when Value / Tensor enable second-derivatives.
  • Convex optimization beyond gradient descent variants (interior point, simplex, conjugate gradient). Out of curriculum scope.
  • Stochastic vs deterministic gradient descent convergence proofs. Mentioned for vocabulary; not derived.
  • Adaptive optimizers beyond Adam/AdamW (Shampoo, Lion, Sophia). Phase 24+ if frontier-relevant.
  • Newton's method and quasi-Newton (BFGS, L-BFGS). Out of scope.

Phase 4's scope is the calculus and optimization that immediately drive neural-network training. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.