Skip to content

English · Español

04 — Reverse-mode vs forward-mode

🇪🇸 Hay dos formas de hacer autodiferenciación: forward-mode (propagar derivadas adelante junto con los valores) y reverse-mode (recorrer hacia atrás desde la pérdida). ML usa reverse-mode porque tiene una salida (la pérdida) y millones de entradas (los pesos). Esta página explica el porqué con una sola tabla.


The setup

You have a function f: Rⁿ → Rᵐ. You want the Jacobian — the matrix of all partial derivatives ∂fᵢ/∂xⱼ, shape (m, n).

Automatic differentiation gives you Jacobian-vector products (JVPs) or vector-Jacobian products (VJPs) one at a time. The question is: which one matches your problem shape?

Forward-mode: JVP

Computes: (∂f/∂x) · v for a chosen direction vector v ∈ Rⁿ. Cost: ~1 forward pass.

To get the full Jacobian, you'd run forward-mode n times, with v set to each standard basis vector e₁, e₂, ..., eₙ. Each run gives one column of the Jacobian. Total cost: n forward passes.

Mechanics: alongside the value of each intermediate variable, also propagate a "tangent" — the derivative of that intermediate with respect to the chosen input direction. The tangent is updated by forward-mode chain rule: if c = a · b and we know tangents ȧ = ∂a/∂x · v_input and ḃ = ∂b/∂x · v_input, then ċ = b · ȧ + a · ḃ.

Easy to implement. No graph needed. But: one input direction at a time.

Reverse-mode: VJP

Computes: w · (∂f/∂x) for a chosen direction vector w ∈ Rᵐ. Cost: ~1 forward + ~1 backward pass.

To get the full Jacobian, you'd run reverse-mode m times, with w set to each eᵢ. Each run gives one row of the Jacobian. Total cost: m forward+backward passes.

Mechanics: build a DAG during forward, then traverse it in reverse applying local derivatives — exactly what we built in §03.

Harder to implement (graph + topo sort + closures). But: one output direction at a time.

The decision

For Jacobian shape (m, n):

  • Forward-mode is cheap when n is small — few inputs to vary.
  • Reverse-mode is cheap when m is small — few outputs to back-propagate from.

The ML case

In ML, the function is:

f: (parameters, data) → scalar loss
  • m = 1 (a scalar loss).
  • n = number of parameters — can be in the millions or billions.

Reverse-mode cost: ~1 forward + ~1 backward. Total: ~2× the forward cost, regardless of n.

Forward-mode cost: n forward passes. For a billion-parameter model: a billion forward passes per gradient. Hopeless.

Reverse-mode wins by a factor of n/2 for ML workloads. That's why every framework — PyTorch, JAX, TensorFlow, our minigrad — defaults to reverse-mode.

When forward-mode wins

  • Few-input, many-output functions. E.g., a 3-D parametric curve with thousands of computed properties. n=3, m=10000. Forward-mode: 3 forward passes. Reverse-mode: 10000 backward passes.
  • Higher-order derivatives. Computing a Hessian-vector product (HVP) uses both — typically reverse-over-forward. JAX makes this easy; PyTorch supports it but the ergonomics are clunkier.
  • In-place updates / no graph available. Forward-mode doesn't need a DAG. Useful in physics simulators where building a graph of every intermediate is prohibitive.

What "reverse-mode" pays for

The DAG. Every intermediate Value must be kept alive until backward runs, because backward reads parents' .data. For a deep neural network with millions of activations, that's a lot of RAM.

In Phase 18 we'll introduce activation checkpointing: drop some intermediate activations during forward, recompute them during backward. Costs ~30% more compute, saves 50%+ memory. The technique only makes sense for reverse-mode autodiff (forward-mode has no "memory of intermediates" because tangents flow forward in lockstep with values).

For Phase 7, our graphs are tiny (XOR-MLP has maybe 100 nodes total). No memory concerns. The point is to feel the algorithm.

The single picture that explains everything

            Cost to compute the Jacobian:

                                forward-mode      reverse-mode
n inputs, m outputs:            O(n · cost_fwd)   O(m · cost_fwd)
ML: n = many params, m = 1:     O(n)              O(1)        ← reverse-mode wins
parametric curve: n=3, m=many:  O(3)              O(m)        ← forward-mode wins

That's the whole choice. Reverse-mode for ML, forward-mode for the unusual shape.

How does Phase 7 manifest this?

minigrad.scalar.Value.backward() is reverse-mode. There is no forward() method that propagates tangents. There's no jvp(). We have one tool, the right one for ML.

If Borja wants to play with forward-mode later (say, computing an HVP for a curvature analysis), the technique to add is "dual numbers" — a class that wraps a (value, tangent) pair and overrides arithmetic to propagate both. It's a fun exercise; not in scope for the curriculum.

Higher-order derivatives (preview, not implementation)

To compute ∂²L/∂a² (a second derivative), you can:

  1. Reverse-over-reverse: call backward(), then build a second graph over the gradients themselves, then call backward() again. Works in theory; requires the _backward closures to themselves produce Value objects (currently they update .grad as raw floats). Refactor needed.
  2. Reverse-over-forward: push tangents forward, then backward over the result. Cheaper for HVPs.
  3. Finite differences: ε-perturb the parameter and re-do backward.

PyTorch supports (1) with the create_graph=True flag on backward. JAX supports (2) very cleanly. minigrad.scalar supports none of them — we keep grad as a float. Document this in the BLUEPRINT.

Pitfalls in mode-mixing thinking

  1. Don't try to do "forward-mode" by setting all-but-one parameter's grad to zero and seeing the answer. That's reverse-mode with sparse seeds; it doesn't get you forward-mode efficiency.
  2. backward() always produces gradients with respect to leaves. It doesn't natively give you Jacobians of intermediate-to-intermediate. To get ∂(intermediate)/∂(leaf), you call intermediate.backward() (treating the intermediate as if it were the loss).
  3. Detaching a node breaks gradient flow. PyTorch's .detach() makes a Tensor that doesn't contribute to backward. minigrad.scalar doesn't ship detach() (no use case at scalar grain). Phase 8 considers it.

One-paragraph recap

Forward-mode and reverse-mode autodiff are dual algorithms for computing Jacobians; forward-mode costs scale with the number of inputs, reverse-mode with the number of outputs. ML has one scalar loss and many parameters, so reverse-mode is the universal choice — cost is ~2× forward, regardless of parameter count. Phase 7 (and every later phase) only implements reverse-mode. Forward-mode and higher-order derivatives are interesting but explicitly out of scope. The single picture: O(min(n, m)) Jacobian cost — pick the mode that aligns with whichever dimension is small.


End of Phase 7 theory. Move to lab/00-value-skeleton.md.