English · Español

Phase 04 — Calculus & Optimization for AI¶

Requires: 03 — Linear Algebra from First Principles Teaches: gradients · chain-rule · backprop · sgd · momentum · adam Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory + lab statements are stable drafts; solutions land just-in-time at phase open.

🇪🇸 Backprop = regla de la cadena, mecánicamente. Adam = SGD con dos medias móviles y una corrección de sesgo. Esta fase deriva ambos a mano, los implementa, y los visualiza corriendo sobre Rosenbrock.

Goal¶

Internalize the calculus that drives backpropagation and the optimization that drives training:

Derive ∂/∂x softmax(x)_i and ∂/∂x CE(softmax(x), y) on a blank page.
Derive Adam's update rule from "running mean of gradient, running mean of squared gradient, bias-corrected."
Implement four optimizers in src/minigrad/optim.py (SGD, momentum, Adam, AdamW) as pure functions operating on state dicts.
Animate their trajectories on the Rosenbrock function and articulate why each behaves the way it does.
Explain why learning-rate warmup helps as a Hessian-conditioning argument, not as a heuristic.

This is the calculus prerequisite for Phase 7 (scalar autograd from scratch) and Phase 16 (the first real training loop).

Read order¶

theory/00-motivation.md — why a calculus phase before any autograd.
theory/01-derivatives-gradients-jacobians.md — the vocabulary of multivariate calculus, with the explicit picture of Jacobian-as-matrix.
theory/02-chain-rule-and-backprop.md — chain rule single-and-multivariate, the explicit derivation of ∂ CE/∂x for softmax + cross-entropy. The most important page in Phase 4.
theory/03-optimizers-derived.md — gradient descent → SGD → momentum → Adam → AdamW. Each derived, not memorized.
theory/04-lr-schedules-and-warmup.md — constant / step / cosine / warmup; the Hessian-conditioning argument for why warmup helps.
lab/00-derive-softmax-gradient.md — derive ∂ softmax/∂x and ∂ CE/∂x on paper; verify numerically.
lab/01-jacobian-by-hand.md — compute a Jacobian analytically and via finite differences; verify match.
lab/02-optimizers-on-rosenbrock.md — race four optimizers; animate.
lab/03-lr-schedules.md — implement and plot five LR schedules.

solutions/ is empty at pre-write — populated at phase open.

Definition of Done¶

See PHASE_04_PLAN.md §6. Briefly:

src/minigrad/optim.py exists with four step functions; tests pass.
Rosenbrock animation committed.
LR schedule chart committed.
Jacobian analytical-vs-numerical verification within 1e-4.
Borja can derive Adam, derive ∂ softmax/∂x, and explain warmup.

What this phase intentionally does NOT cover¶

Autograd implementation. Phase 7 (scalar) and Phase 8 (tensor).
Hessian computation in code. Theory only; the work is in Phase 7+ when Value / Tensor enable second-derivatives.
Convex optimization beyond gradient descent variants (interior point, simplex, conjugate gradient). Out of curriculum scope.
Stochastic vs deterministic gradient descent convergence proofs. Mentioned for vocabulary; not derived.
Adaptive optimizers beyond Adam/AdamW (Shampoo, Lion, Sophia). Phase 24+ if frontier-relevant.
Newton's method and quasi-Newton (BFGS, L-BFGS). Out of scope.

Phase 4's scope is the calculus and optimization that immediately drive neural-network training. Nothing more.