English · Español
04 — Reverse-mode vs forward-mode¶
🇪🇸 Hay dos formas de hacer autodiferenciación: forward-mode (propagar derivadas adelante junto con los valores) y reverse-mode (recorrer hacia atrás desde la pérdida). ML usa reverse-mode porque tiene una salida (la pérdida) y millones de entradas (los pesos). Esta página explica el porqué con una sola tabla.
The setup¶
You have a function f: Rⁿ → Rᵐ. You want the Jacobian — the matrix of all partial derivatives ∂fᵢ/∂xⱼ, shape (m, n).
Automatic differentiation gives you Jacobian-vector products (JVPs) or vector-Jacobian products (VJPs) one at a time. The question is: which one matches your problem shape?
Forward-mode: JVP¶
Computes: (∂f/∂x) · v for a chosen direction vector v ∈ Rⁿ. Cost: ~1 forward pass.
To get the full Jacobian, you'd run forward-mode n times, with v set to each standard basis vector e₁, e₂, ..., eₙ. Each run gives one column of the Jacobian. Total cost: n forward passes.
Mechanics: alongside the value of each intermediate variable, also propagate a "tangent" — the derivative of that intermediate with respect to the chosen input direction. The tangent is updated by forward-mode chain rule: if c = a · b and we know tangents ȧ = ∂a/∂x · v_input and ḃ = ∂b/∂x · v_input, then ċ = b · ȧ + a · ḃ.
Easy to implement. No graph needed. But: one input direction at a time.
Reverse-mode: VJP¶
Computes: w · (∂f/∂x) for a chosen direction vector w ∈ Rᵐ. Cost: ~1 forward + ~1 backward pass.
To get the full Jacobian, you'd run reverse-mode m times, with w set to each eᵢ. Each run gives one row of the Jacobian. Total cost: m forward+backward passes.
Mechanics: build a DAG during forward, then traverse it in reverse applying local derivatives — exactly what we built in §03.
Harder to implement (graph + topo sort + closures). But: one output direction at a time.
The decision¶
For Jacobian shape (m, n):
- Forward-mode is cheap when n is small — few inputs to vary.
- Reverse-mode is cheap when m is small — few outputs to back-propagate from.
The ML case¶
In ML, the function is:
m = 1(a scalar loss).n = number of parameters— can be in the millions or billions.
Reverse-mode cost: ~1 forward + ~1 backward. Total: ~2× the forward cost, regardless of n.
Forward-mode cost: n forward passes. For a billion-parameter model: a billion forward passes per gradient. Hopeless.
Reverse-mode wins by a factor of n/2 for ML workloads. That's why every framework — PyTorch, JAX, TensorFlow, our minigrad — defaults to reverse-mode.
When forward-mode wins¶
- Few-input, many-output functions. E.g., a 3-D parametric curve with thousands of computed properties.
n=3, m=10000. Forward-mode: 3 forward passes. Reverse-mode: 10000 backward passes. - Higher-order derivatives. Computing a Hessian-vector product (HVP) uses both — typically reverse-over-forward. JAX makes this easy; PyTorch supports it but the ergonomics are clunkier.
- In-place updates / no graph available. Forward-mode doesn't need a DAG. Useful in physics simulators where building a graph of every intermediate is prohibitive.
What "reverse-mode" pays for¶
The DAG. Every intermediate Value must be kept alive until backward runs, because backward reads parents' .data. For a deep neural network with millions of activations, that's a lot of RAM.
In Phase 18 we'll introduce activation checkpointing: drop some intermediate activations during forward, recompute them during backward. Costs ~30% more compute, saves 50%+ memory. The technique only makes sense for reverse-mode autodiff (forward-mode has no "memory of intermediates" because tangents flow forward in lockstep with values).
For Phase 7, our graphs are tiny (XOR-MLP has maybe 100 nodes total). No memory concerns. The point is to feel the algorithm.
The single picture that explains everything¶
Cost to compute the Jacobian:
forward-mode reverse-mode
n inputs, m outputs: O(n · cost_fwd) O(m · cost_fwd)
ML: n = many params, m = 1: O(n) O(1) ← reverse-mode wins
parametric curve: n=3, m=many: O(3) O(m) ← forward-mode wins
That's the whole choice. Reverse-mode for ML, forward-mode for the unusual shape.
How does Phase 7 manifest this?¶
minigrad.scalar.Value.backward() is reverse-mode. There is no forward() method that propagates tangents. There's no jvp(). We have one tool, the right one for ML.
If Borja wants to play with forward-mode later (say, computing an HVP for a curvature analysis), the technique to add is "dual numbers" — a class that wraps a (value, tangent) pair and overrides arithmetic to propagate both. It's a fun exercise; not in scope for the curriculum.
Higher-order derivatives (preview, not implementation)¶
To compute ∂²L/∂a² (a second derivative), you can:
- Reverse-over-reverse: call
backward(), then build a second graph over the gradients themselves, then callbackward()again. Works in theory; requires the_backwardclosures to themselves produceValueobjects (currently they update.gradas raw floats). Refactor needed. - Reverse-over-forward: push tangents forward, then backward over the result. Cheaper for HVPs.
- Finite differences: ε-perturb the parameter and re-do backward.
PyTorch supports (1) with the create_graph=True flag on backward. JAX supports (2) very cleanly. minigrad.scalar supports none of them — we keep grad as a float. Document this in the BLUEPRINT.
Pitfalls in mode-mixing thinking¶
- Don't try to do "forward-mode" by setting all-but-one parameter's grad to zero and seeing the answer. That's reverse-mode with sparse seeds; it doesn't get you forward-mode efficiency.
backward()always produces gradients with respect to leaves. It doesn't natively give you Jacobians of intermediate-to-intermediate. To get∂(intermediate)/∂(leaf), you callintermediate.backward()(treating the intermediate as if it were the loss).- Detaching a node breaks gradient flow. PyTorch's
.detach()makes aTensorthat doesn't contribute to backward.minigrad.scalardoesn't shipdetach()(no use case at scalar grain). Phase 8 considers it.
One-paragraph recap¶
Forward-mode and reverse-mode autodiff are dual algorithms for computing Jacobians; forward-mode costs scale with the number of inputs, reverse-mode with the number of outputs. ML has one scalar loss and many parameters, so reverse-mode is the universal choice — cost is ~2× forward, regardless of parameter count. Phase 7 (and every later phase) only implements reverse-mode. Forward-mode and higher-order derivatives are interesting but explicitly out of scope. The single picture: O(min(n, m)) Jacobian cost — pick the mode that aligns with whichever dimension is small.
End of Phase 7 theory. Move to lab/00-value-skeleton.md.