Skip to content

English · Español

00 — Motivation: why scalar autograd is the right entry point

🇪🇸 Backprop tiene fama de oscuro. Lo es… si lo aprendes desde transformers gigantes. Si lo aprendes desde un único número — una Value que envuelve un float y guarda quién es su padre — la regla de la cadena deja de ser magia y se vuelve un recorrido inverso por un grafo. Esta fase mete esa intuición en hueso.


What you'll have built when this phase is done

A Python class Value that wraps a single Python float. You can do:

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c.tanh()
d.backward()

print(a.grad)   # ∂d/∂a, computed automatically
print(b.grad)   # ∂d/∂b
print(c.grad)   # ∂d/∂c

…and the three gradients are correct, computed by a topological reverse traversal of a DAG that Value built during the forward pass. No NumPy. No PyTorch. ~150 lines of Borja's own code.

This is, in miniature, exactly what PyTorch does. The differences are scale (tensors instead of floats), performance (C/CUDA kernels), and ergonomics (modules, optimizers). The idea is identical.

Why "build it once at scalar grain"

There's a common pedagogical path that goes "use PyTorch first, understand it later". It works for users who never need to understand backprop. For Borja (and for this curriculum's pedagogical contract — CLAUDE.md §0.4), it's the wrong path. The reason:

Tensor autograd has three sources of complexity stacked on top of each other.

  1. The autograd algorithm itself (DAG, reverse traversal, chain rule).
  2. NumPy mechanics (shapes, broadcasting, strides — Phase 6 covered these).
  3. Per-op gradient derivations (matmul backward, softmax backward — non-trivial linear algebra).

If you debug a wrong gradient in PyTorch on day one, you don't know which of the three is biting you. Was the topo sort wrong? Was the broadcast reverse wrong? Did you derive the matmul backward incorrectly?

Scalar autograd has only the first source of complexity. No shapes (everything is a float). No broadcasting. No per-op derivations beyond what fits on a single sheet of paper (+ - * / etc. are trivial). Everything that goes wrong in this phase is the autograd algorithm. You'll learn exactly what that looks like — what the right kind of bug feels like, what the right kind of test catches it.

When Phase 8 adds NumPy and tensor-shaped derivatives, those will be the new sources of bugs. You'll be able to isolate them because you already trust the algorithm itself.

What backprop is, in two paragraphs

Forward pass: you write d = a * b + c.tanh(). Python evaluates this left-to-right (operator precedence aside) and produces a number. Along the way, each operator creates a Value node that remembers its parents and an operation tag. By the time d is assigned, an in-memory DAG exists: d knows it was made from "a*b" and "c.tanh()" via addition; the * node knows it was made from a and b; the tanh node knows it was made from c. Five Value objects, five edges, one shape.

Backward pass: d.backward() does two things. First, it walks the DAG from d outward (parents-of-parents-of-…) to produce a topological order. Then it walks that order in reverse, starting by setting d.grad = 1.0 (the seed: ∂d/∂d = 1), and at each node, applies a small local rule that adds this node's contribution to its parents' .grad attributes. By the end of the reverse walk, every node — including a, b, c — has its .grad set to the partial derivative of d with respect to it.

The "small local rule" at each node is the local derivative of that op. For multiplication c = a*b: c's gradient is multiplied by b to contribute to a's gradient, and by a to contribute to b's gradient — because ∂c/∂a = b and ∂c/∂b = a. Each op has one such rule. There are about ten of them and they all fit on one page.

That's it. That's all of backprop, ever, in any framework, at any scale.

Why this scales (and why Phase 8 is "just more of the same")

The scalar Value you build in Phase 7 will be replaced in Phase 8 by a Tensor wrapping a NumPy array. The five things that are the same:

  • DAG structure (nodes + parents + op tag).
  • Forward builds the DAG.
  • Backward = topo sort + reverse traversal.
  • Each op contributes a local rule.
  • _backward is a closure that captures the parents and applies the rule.

The five things that change:

  • data is an ndarray, not a float.
  • grad is an ndarray of the same shape as data, not a float.
  • The local rules become tensor-shaped (e.g., matmul backward).
  • Broadcasting needs to be reversed in the backward (sum-along-broadcast-axes).
  • Per-op tests need gradcheck (finite differences) in addition to PyTorch cross-checks.

Phase 7 nails the first list. Phase 8 adds the second. The phase split is deliberate: one source of complexity at a time.

Why a small project, even at scalar grain

Value is ~150 lines. The XOR-MLP training in experiments/07-train-xor/ is ~50 lines. The graph visualizer is ~30. The tests are ~200. Total: ~430 lines.

Is that "small"? Compared to PyTorch (~3M lines), yes. Compared to micrograd (Karpathy's reference, ~100 lines of Value), it's the same order. The point of building it small is: at no point do you have a black box. Every line you write is yours; every line you read is short. When backprop produces a number you don't trust, you can step through it in pdb and see, one op at a time, exactly what happened.

The XOR example is also deliberately small. XOR is not a serious learning task — it's a 4-point dataset that a tiny MLP can memorize in seconds. The point is to see that backprop, end-to-end, works: forward, loss, backward, parameter update, repeat. When loss goes from 0.7 to 0.001 over 500 steps using only your Value class, the project is unblocked. Phase 8 can begin.

What pedagogy this phase honors

CLAUDE.md §0.2: Borja writes the implementation; Claude scaffolds.

That means:

  • Claude writes src/minigrad/scalar/BLUEPRINT.md — purpose, API, alternatives, anti-goals.
  • Claude writes theory/ — derivations, worked examples.
  • Claude writes lab/ — problem statements with TODOs and constraints, no answers.
  • Claude writes failing test stubs in tests/test_scalar_autograd.py — list of ops to cover, expected tolerance, comparison oracle.
  • Borja writes src/minigrad/scalar.py. The actual class body.
  • Borja fills the test bodies (Claude provided the comments).
  • Borja decides design questions like "is tanh native or via exp" (defaults are documented, but the call is his).

If Borja is stuck, the path is: re-read theory, write a smaller test, look at the worked example in theory/03, ask math-reviewer subagent. Not "look at the solution".

solutions/ will be written at phase open, after Borja's prior-phase API choices are visible (per addendum §A12). Until then, solutions/ is empty.

One-paragraph recap

Scalar autograd is the smallest possible context in which backprop is fully alive. By wrapping a single float in a Value class that records its parents and a local-derivative closure, the forward pass builds a DAG and the backward pass — a topological reverse traversal applying the chain rule — populates .grad on every node. ~150 lines, no NumPy, no PyTorch. When Phase 8 lifts to tensors, the algorithm is unchanged; only the data type, the per-op derivatives, and the broadcasting reverse are new. Phase 7 nails the algorithm. Build it once at scalar grain, and you'll never again wonder what .backward() is doing under the hood.


Next: 01-computation-graphs.md