English · Español
01 — Computation graphs¶
🇪🇸 Un grafo de computación es un DAG: nodos = valores, aristas = operaciones que produjeron cada valor a partir de sus padres. Se construye implícitamente durante el forward pass cuando Python evalúa expresiones. Es el esqueleto sobre el que después corre el backward.
What a computation graph is¶
A computation graph is a directed acyclic graph (DAG) in which:
- Nodes are intermediate values.
- Edges point from parents to children: if
c = a * b, then there are edgesa → candb → c. - Each non-leaf node is associated with an operation (the
_optag in ourValueclass).
By convention in autograd, we sometimes flip the edge direction for visualization, drawing arrows from children to parents — because that's the direction backward traversal will take. We'll use both depending on context; the data structure is the same either way.
A small example¶
Consider d = a * b + c.
Five nodes (a, b, c, ab, d), four edges, two ops (*, +). The DAG is acyclic (no cycles), directed (edges have a direction), and rooted at the output (d).
How it gets built¶
Python evaluates d = a * b + c left-to-right with normal precedence:
a * binvokesValue.__mul__(a, b). This method:- Computes the data:
a.data * b.data = 6.0(saya=2, b=3). - Creates a new
Valuewith that data. - Sets
_prev = (a, b)and_op = '*'on the new node. - Attaches a
_backwardclosure that, when called, will addb.data * out.gradtoa.gradanda.data * out.gradtob.grad. - Returns the new
Value. Let's call itab. ab + cinvokesValue.__add__(ab, c). Similar story: newValuewith dataab.data + c.data = 6 + 10 = 16,_prev = (ab, c),_op = '+',_backwardthat addsout.gradto both parents' grads. Returnsd.
By the time the assignment d = ... completes, the entire DAG exists, with parent pointers in place and _backward closures armed. The forward pass is the graph-building pass. There's no separate "compile" or "trace" step — that's eager mode, and that's how PyTorch (and micrograd, and our minigrad) work.
Why a DAG (not a tree, not a general graph)¶
Why acyclic: Values are created strictly later than their parents. A new Value can only point at already-existing parents. There is no way to create a cycle (you'd need a node to be its own ancestor, but at creation time none of its ancestors exist yet).
Why directed: the parent-child relationship is intrinsically directional. a * b makes a and b parents of the product; there's no symmetry.
Why not a tree: a single Value can be a parent to multiple children. Example: d = a * b + a has a appearing twice — once as a multiplicand, once as an addend. a has two outgoing edges (one to ab, one to d). This is the diamond pattern, and it's where naive backprop implementations break (we'll see why in 03-worked-backprop.md).
(I drew a twice in ASCII but it's literally the same object in memory.)
Leaves vs internal nodes¶
A leaf is a Value whose _prev is empty — i.e., it wasn't produced by any operation, it was constructed directly with Value(some_float). Leaves are typically:
- Parameters — learnable weights and biases.
- Inputs — features fed to the model.
- Constants — hand-set values that don't change.
An internal node has _prev containing at least one parent. It was produced by an op.
This matters because:
- During backward, leaves'
_backwardis a no-op (nothing to propagate further — they have no parents). - During training, only parameter leaves get updated by the optimizer; input/constant leaves keep their values. (In Phase 9,
Parameterwill be the explicit marker.)
In Phase 7 we don't distinguish — every Value(...) is a leaf, and the XOR training loop manually applies updates to the weights. Phase 9 introduces Parameter as a Tensor subclass that the optimizer knows about.
What _backward is, exactly¶
_backward is a function with no arguments that, when called, contributes this node's gradient back to its parents' gradients. It's a closure — it captures the parents by reference at op-creation time.
Concretely, for c = a * b:
def _backward():
a.grad += b.data * c.grad # ∂c/∂a · ∂L/∂c = b · ∂L/∂c
b.grad += a.data * c.grad # ∂c/∂b · ∂L/∂c = a · ∂L/∂c
c._backward = _backward
Three things to notice:
+=, not=. Because a parent might have multiple children, and each child's contribution must be summed. (Diamond case.)- Reads
c.grad. This is why backward must be called in topological order —c.gradmust be fully accumulated fromc's downstream children beforec._backwardruns, so the gradient it propagates is correct. - Closure captures
aandbby reference. When_backwardis called (potentially much later, after many more ops have been added to the graph), it still refers to the sameaandbobjects. Python's closure semantics make this work; we just need to be careful not to use a loop variable carelessly.
The leaf's _backward is the default no-op: lambda: None. Internal nodes get a meaningful closure when their op constructor runs.
How backward traversal works (topological sort + reverse walk)¶
The backward algorithm:
def backward(self):
# 1. Build topological order: parents before children.
topo = []
visited = set()
def build(v):
if v in visited: return
visited.add(v)
for p in v._prev:
build(p)
topo.append(v)
build(self)
# 2. Seed: ∂L/∂L = 1.
self.grad = 1.0
# 3. Walk in reverse: children before parents.
for v in reversed(topo):
v._backward()
Let's trace it on d = a*b + c:
build(d): visitsd's parents(ab, c)first.build(ab): visitsab's parents(a, b)first.build(a): no parents, appendsa.topo = [a].build(b): no parents, appendsb.topo = [a, b].- After
ab's parents are done, appendab.topo = [a, b, ab]. build(c): no parents, appendsc.topo = [a, b, ab, c].- Append
d.topo = [a, b, ab, c, d].
Reverse: [d, c, ab, b, a]. Seed d.grad = 1. Walk:
d._backward():ab.grad += 1 * d.grad = 1;c.grad += 1 * d.grad = 1. (Add op's local derivative is 1 for both parents.)c._backward(): no-op (leaf).ab._backward():a.grad += b.data * ab.grad = 3 * 1 = 3;b.grad += a.data * ab.grad = 2 * 1 = 2.b._backward(): no-op.a._backward(): no-op.
Final gradients: a.grad = 3, b.grad = 2, c.grad = 1. Verify by hand: d = ab + c = a*b + c. ∂d/∂a = b = 3 ✓. ∂d/∂b = a = 2 ✓. ∂d/∂c = 1 ✓.
The diamond pattern, in slow motion¶
d = a*b + a (a used twice):
ab = a * b.ab._prev = (a, b).d = ab + a.d._prev = (ab, a).aappears in both(a, b)and(ab, a)— same object both times.
Topo sort (one valid order): [b, a, ab, d]. Reverse: [d, ab, a, b]. Seed d.grad = 1.
d._backward():ab.grad += 1;a.grad += 1. Nowa.grad = 1.ab._backward():a.grad += b.data * ab.grad = b·1 = b;b.grad += a.data * ab.grad = a·1 = a.
After: a.grad = 1 + b. Verify: d = a·b + a. ∂d/∂a = b + 1 ✓.
The two contributions to a.grad arrived in two separate _backward calls. They summed because we used +=. If we'd used =, the second call would have overwritten the first and we'd have a.grad = b, which is wrong.
This is the single most important reason _backward uses +=. Test it explicitly.
Why this is "reverse-mode autodiff"¶
Two flavors of automatic differentiation:
- Forward-mode: propagate derivative information forward alongside data. To compute
∂L/∂afor some parametera, you seedawith derivative 1, set every other parameter's derivative to 0, and propagate forward. Cost: one forward pass per parameter. Bad when you have millions of parameters and one loss. - Reverse-mode (this phase): propagate derivative information backward from the loss. One backward pass gives you ∂L/∂every-parameter. Cost: one forward + one backward, regardless of parameter count.
For LLMs: 1 loss, billions of parameters → reverse-mode wins. For physics simulations with thousands of inputs and a single output → reverse-mode wins. For finance with a few inputs and many outputs → forward-mode might win.
The Karpathy rule of thumb: reverse-mode when n_outputs << n_inputs. Forward-mode when n_outputs >> n_inputs. Phase 7 only builds reverse-mode; that's all ML uses.
Pitfalls to lock in (preview of common bugs)¶
These will bite in lab. Heads-up:
- Forgetting
+=— using=in_backwardsilently drops contributions for nodes used multiple times. The diamond test catches it. - Re-running
backward()without zeroing — gradients accumulate; second call doubles them. Phase 9's optimizer will introducezero_grad(). - Mutating
databetween forward and backward — the backward closure captureda.data(as a value, since float is immutable) at op-creation time… wait, actually, the closure capturesaas an object, and readsa.dataat call time. If you mutatea.databetween forward and backward, the closure will read the new value. Subtle. Don't mutate in flight. - Topo sort with cycles — can't happen by construction in our framework, but a defensive
assertis cheap. set()containingValueinstances —Valueneeds__hash__forsetmembership. Defaultobjecthash (by identity) works fine. Don't override__eq__unless you also override__hash__.
One-paragraph recap¶
A computation graph is a DAG where each non-leaf node is associated with an operation and a closure that contributes the node's gradient back to its parents. The graph is built implicitly during the forward pass by overriding Python's arithmetic operators; the backward pass is a topological sort followed by a reverse walk, calling each node's _backward closure to accumulate gradients via the chain rule. The += accumulation handles the diamond case where one node has multiple children. This is reverse-mode automatic differentiation — the right choice for ML because we have many parameters and one scalar loss.
Next: 02-op-derivatives.md