English · Español

01 — Computation graphs¶

🇪🇸 Un grafo de computación es un DAG: nodos = valores, aristas = operaciones que produjeron cada valor a partir de sus padres. Se construye implícitamente durante el forward pass cuando Python evalúa expresiones. Es el esqueleto sobre el que después corre el backward.

What a computation graph is¶

A computation graph is a directed acyclic graph (DAG) in which:

Nodes are intermediate values.
Edges point from parents to children: if c = a * b, then there are edges a → c and b → c.
Each non-leaf node is associated with an operation (the _op tag in our Value class).

By convention in autograd, we sometimes flip the edge direction for visualization, drawing arrows from children to parents — because that's the direction backward traversal will take. We'll use both depending on context; the data structure is the same either way.

A small example¶

Consider d = a * b + c.

a ──┐
    ╳ (*)──> ab ──┐
b ──┘             ╳ (+)──> d
                  │
c ────────────────┘

Five nodes (a, b, c, ab, d), four edges, two ops (*, +). The DAG is acyclic (no cycles), directed (edges have a direction), and rooted at the output (d).

How it gets built¶

Python evaluates d = a * b + c left-to-right with normal precedence:

a * b invokes Value.__mul__(a, b). This method:
Computes the data: a.data * b.data = 6.0 (say a=2, b=3).
Creates a new Value with that data.
Sets _prev = (a, b) and _op = '*' on the new node.
Attaches a _backward closure that, when called, will add b.data * out.grad to a.grad and a.data * out.grad to b.grad.
Returns the new Value. Let's call it ab.
ab + c invokes Value.__add__(ab, c). Similar story: new Value with data ab.data + c.data = 6 + 10 = 16, _prev = (ab, c), _op = '+', _backward that adds out.grad to both parents' grads. Returns d.

By the time the assignment d = ... completes, the entire DAG exists, with parent pointers in place and _backward closures armed. The forward pass is the graph-building pass. There's no separate "compile" or "trace" step — that's eager mode, and that's how PyTorch (and micrograd, and our minigrad) work.

Why a DAG (not a tree, not a general graph)¶

Why acyclic: Values are created strictly later than their parents. A new Value can only point at already-existing parents. There is no way to create a cycle (you'd need a node to be its own ancestor, but at creation time none of its ancestors exist yet).

Why directed: the parent-child relationship is intrinsically directional. a * b makes a and b parents of the product; there's no symmetry.

Why not a tree: a single Value can be a parent to multiple children. Example: d = a * b + a has a appearing twice — once as a multiplicand, once as an addend. a has two outgoing edges (one to ab, one to d). This is the diamond pattern, and it's where naive backprop implementations break (we'll see why in 03-worked-backprop.md).

a ──┬──> ab ──┐
    │         ╳ (+)──> d
b ──┘         │
    ┌─────────┘
a (same node)

(I drew a twice in ASCII but it's literally the same object in memory.)

Leaves vs internal nodes¶

A leaf is a Value whose _prev is empty — i.e., it wasn't produced by any operation, it was constructed directly with Value(some_float). Leaves are typically:

Parameters — learnable weights and biases.
Inputs — features fed to the model.
Constants — hand-set values that don't change.

An internal node has _prev containing at least one parent. It was produced by an op.

This matters because:

During backward, leaves' _backward is a no-op (nothing to propagate further — they have no parents).
During training, only parameter leaves get updated by the optimizer; input/constant leaves keep their values. (In Phase 9, Parameter will be the explicit marker.)

In Phase 7 we don't distinguish — every Value(...) is a leaf, and the XOR training loop manually applies updates to the weights. Phase 9 introduces Parameter as a Tensor subclass that the optimizer knows about.

What `_backward` is, exactly¶

_backward is a function with no arguments that, when called, contributes this node's gradient back to its parents' gradients. It's a closure — it captures the parents by reference at op-creation time.

Concretely, for c = a * b:

def _backward():
    a.grad += b.data * c.grad   # ∂c/∂a · ∂L/∂c = b · ∂L/∂c
    b.grad += a.data * c.grad   # ∂c/∂b · ∂L/∂c = a · ∂L/∂c
c._backward = _backward

Three things to notice:

+=, not =. Because a parent might have multiple children, and each child's contribution must be summed. (Diamond case.)
Reads c.grad. This is why backward must be called in topological order — c.grad must be fully accumulated from c's downstream children before c._backward runs, so the gradient it propagates is correct.
Closure captures a and b by reference. When _backward is called (potentially much later, after many more ops have been added to the graph), it still refers to the same a and b objects. Python's closure semantics make this work; we just need to be careful not to use a loop variable carelessly.

The leaf's _backward is the default no-op: lambda: None. Internal nodes get a meaningful closure when their op constructor runs.

How backward traversal works (topological sort + reverse walk)¶

The backward algorithm:

def backward(self):
    # 1. Build topological order: parents before children.
    topo = []
    visited = set()
    def build(v):
        if v in visited: return
        visited.add(v)
        for p in v._prev:
            build(p)
        topo.append(v)
    build(self)

    # 2. Seed: ∂L/∂L = 1.
    self.grad = 1.0

    # 3. Walk in reverse: children before parents.
    for v in reversed(topo):
        v._backward()

Let's trace it on d = a*b + c:

build(d): visits d's parents (ab, c) first.
build(ab): visits ab's parents (a, b) first.
build(a): no parents, appends a. topo = [a].
build(b): no parents, appends b. topo = [a, b].
After ab's parents are done, append ab. topo = [a, b, ab].
build(c): no parents, appends c. topo = [a, b, ab, c].
Append d. topo = [a, b, ab, c, d].

Reverse: [d, c, ab, b, a]. Seed d.grad = 1. Walk:

d._backward(): ab.grad += 1 * d.grad = 1; c.grad += 1 * d.grad = 1. (Add op's local derivative is 1 for both parents.)
c._backward(): no-op (leaf).
ab._backward(): a.grad += b.data * ab.grad = 3 * 1 = 3; b.grad += a.data * ab.grad = 2 * 1 = 2.
b._backward(): no-op.
a._backward(): no-op.

Final gradients: a.grad = 3, b.grad = 2, c.grad = 1. Verify by hand: d = ab + c = a*b + c. ∂d/∂a = b = 3 ✓. ∂d/∂b = a = 2 ✓. ∂d/∂c = 1 ✓.

The diamond pattern, in slow motion¶

d = a*b + a (a used twice):

ab = a * b. ab._prev = (a, b).
d = ab + a. d._prev = (ab, a).
a appears in both (a, b) and (ab, a) — same object both times.

Topo sort (one valid order): [b, a, ab, d]. Reverse: [d, ab, a, b]. Seed d.grad = 1.

d._backward(): ab.grad += 1; a.grad += 1. Now a.grad = 1.
ab._backward(): a.grad += b.data * ab.grad = b·1 = b; b.grad += a.data * ab.grad = a·1 = a.

After: a.grad = 1 + b. Verify: d = a·b + a. ∂d/∂a = b + 1 ✓.

The two contributions to a.grad arrived in two separate _backward calls. They summed because we used +=. If we'd used =, the second call would have overwritten the first and we'd have a.grad = b, which is wrong.

This is the single most important reason _backward uses +=. Test it explicitly.

Why this is "reverse-mode autodiff"¶

Two flavors of automatic differentiation:

Forward-mode: propagate derivative information forward alongside data. To compute ∂L/∂a for some parameter a, you seed a with derivative 1, set every other parameter's derivative to 0, and propagate forward. Cost: one forward pass per parameter. Bad when you have millions of parameters and one loss.
Reverse-mode (this phase): propagate derivative information backward from the loss. One backward pass gives you ∂L/∂every-parameter. Cost: one forward + one backward, regardless of parameter count.

For LLMs: 1 loss, billions of parameters → reverse-mode wins. For physics simulations with thousands of inputs and a single output → reverse-mode wins. For finance with a few inputs and many outputs → forward-mode might win.

The Karpathy rule of thumb: reverse-mode when n_outputs << n_inputs. Forward-mode when n_outputs >> n_inputs. Phase 7 only builds reverse-mode; that's all ML uses.

Pitfalls to lock in (preview of common bugs)¶

These will bite in lab. Heads-up:

Forgetting += — using = in _backward silently drops contributions for nodes used multiple times. The diamond test catches it.
Re-running backward() without zeroing — gradients accumulate; second call doubles them. Phase 9's optimizer will introduce zero_grad().
Mutating data between forward and backward — the backward closure captured a.data (as a value, since float is immutable) at op-creation time… wait, actually, the closure captures a as an object, and reads a.data at call time. If you mutate a.data between forward and backward, the closure will read the new value. Subtle. Don't mutate in flight.
Topo sort with cycles — can't happen by construction in our framework, but a defensive assert is cheap.
set() containing Value instances — Value needs __hash__ for set membership. Default object hash (by identity) works fine. Don't override __eq__ unless you also override __hash__.

One-paragraph recap¶

A computation graph is a DAG where each non-leaf node is associated with an operation and a closure that contributes the node's gradient back to its parents. The graph is built implicitly during the forward pass by overriding Python's arithmetic operators; the backward pass is a topological sort followed by a reverse walk, calling each node's _backward closure to accumulate gradients via the chain rule. The += accumulation handles the diamond case where one node has multiple children. This is reverse-mode automatic differentiation — the right choice for ML because we have many parameters and one scalar loss.

Next: 02-op-derivatives.md