English · Español
Phase 04 — Calculus & Optimization for AI¶
Requires: 03 — Linear Algebra from First Principles Teaches:
gradients·chain-rule·backprop·sgd·momentum·adamJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory + lab statements are stable drafts; solutions land just-in-time at phase open.
🇪🇸 Backprop = regla de la cadena, mecánicamente. Adam = SGD con dos medias móviles y una corrección de sesgo. Esta fase deriva ambos a mano, los implementa, y los visualiza corriendo sobre Rosenbrock.
Goal¶
Internalize the calculus that drives backpropagation and the optimization that drives training:
- Derive
∂/∂x softmax(x)_iand∂/∂x CE(softmax(x), y)on a blank page. - Derive Adam's update rule from "running mean of gradient, running mean of squared gradient, bias-corrected."
- Implement four optimizers in
src/minigrad/optim.py(SGD, momentum, Adam, AdamW) as pure functions operating on state dicts. - Animate their trajectories on the Rosenbrock function and articulate why each behaves the way it does.
- Explain why learning-rate warmup helps as a Hessian-conditioning argument, not as a heuristic.
This is the calculus prerequisite for Phase 7 (scalar autograd from scratch) and Phase 16 (the first real training loop).
Read order¶
theory/00-motivation.md— why a calculus phase before any autograd.theory/01-derivatives-gradients-jacobians.md— the vocabulary of multivariate calculus, with the explicit picture of Jacobian-as-matrix.theory/02-chain-rule-and-backprop.md— chain rule single-and-multivariate, the explicit derivation of∂ CE/∂xfor softmax + cross-entropy. The most important page in Phase 4.theory/03-optimizers-derived.md— gradient descent → SGD → momentum → Adam → AdamW. Each derived, not memorized.theory/04-lr-schedules-and-warmup.md— constant / step / cosine / warmup; the Hessian-conditioning argument for why warmup helps.lab/00-derive-softmax-gradient.md— derive∂ softmax/∂xand∂ CE/∂xon paper; verify numerically.lab/01-jacobian-by-hand.md— compute a Jacobian analytically and via finite differences; verify match.lab/02-optimizers-on-rosenbrock.md— race four optimizers; animate.lab/03-lr-schedules.md— implement and plot five LR schedules.
solutions/ is empty at pre-write — populated at phase open.
Definition of Done¶
See PHASE_04_PLAN.md §6. Briefly:
src/minigrad/optim.pyexists with four step functions; tests pass.- Rosenbrock animation committed.
- LR schedule chart committed.
- Jacobian analytical-vs-numerical verification within
1e-4. - Borja can derive Adam, derive
∂ softmax/∂x, and explain warmup.
What this phase intentionally does NOT cover¶
- Autograd implementation. Phase 7 (scalar) and Phase 8 (tensor).
- Hessian computation in code. Theory only; the work is in Phase 7+ when
Value/Tensorenable second-derivatives. - Convex optimization beyond gradient descent variants (interior point, simplex, conjugate gradient). Out of curriculum scope.
- Stochastic vs deterministic gradient descent convergence proofs. Mentioned for vocabulary; not derived.
- Adaptive optimizers beyond Adam/AdamW (Shampoo, Lion, Sophia). Phase 24+ if frontier-relevant.
- Newton's method and quasi-Newton (BFGS, L-BFGS). Out of scope.
Phase 4's scope is the calculus and optimization that immediately drive neural-network training. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Adam: A Method for Stochastic Optimization — Kingma & Ba · 2014. the optimizer you derive by hand this phase.
- ✍️ An Overview of Gradient Descent Optimization Algorithms — Ruder · 2016. the whole optimizer family in one map.