Skip to content

English · Español

01 — Computation graphs

🇪🇸 Un grafo de computación es un DAG: nodos = valores, aristas = operaciones que produjeron cada valor a partir de sus padres. Se construye implícitamente durante el forward pass cuando Python evalúa expresiones. Es el esqueleto sobre el que después corre el backward.


What a computation graph is

A computation graph is a directed acyclic graph (DAG) in which:

  • Nodes are intermediate values.
  • Edges point from parents to children: if c = a * b, then there are edges a → c and b → c.
  • Each non-leaf node is associated with an operation (the _op tag in our Value class).

By convention in autograd, we sometimes flip the edge direction for visualization, drawing arrows from children to parents — because that's the direction backward traversal will take. We'll use both depending on context; the data structure is the same either way.

A small example

Consider d = a * b + c.

a ──┐
    ╳ (*)──> ab ──┐
b ──┘             ╳ (+)──> d
c ────────────────┘

Five nodes (a, b, c, ab, d), four edges, two ops (*, +). The DAG is acyclic (no cycles), directed (edges have a direction), and rooted at the output (d).

How it gets built

Python evaluates d = a * b + c left-to-right with normal precedence:

  1. a * b invokes Value.__mul__(a, b). This method:
  2. Computes the data: a.data * b.data = 6.0 (say a=2, b=3).
  3. Creates a new Value with that data.
  4. Sets _prev = (a, b) and _op = '*' on the new node.
  5. Attaches a _backward closure that, when called, will add b.data * out.grad to a.grad and a.data * out.grad to b.grad.
  6. Returns the new Value. Let's call it ab.
  7. ab + c invokes Value.__add__(ab, c). Similar story: new Value with data ab.data + c.data = 6 + 10 = 16, _prev = (ab, c), _op = '+', _backward that adds out.grad to both parents' grads. Returns d.

By the time the assignment d = ... completes, the entire DAG exists, with parent pointers in place and _backward closures armed. The forward pass is the graph-building pass. There's no separate "compile" or "trace" step — that's eager mode, and that's how PyTorch (and micrograd, and our minigrad) work.

Why a DAG (not a tree, not a general graph)

Why acyclic: Values are created strictly later than their parents. A new Value can only point at already-existing parents. There is no way to create a cycle (you'd need a node to be its own ancestor, but at creation time none of its ancestors exist yet).

Why directed: the parent-child relationship is intrinsically directional. a * b makes a and b parents of the product; there's no symmetry.

Why not a tree: a single Value can be a parent to multiple children. Example: d = a * b + a has a appearing twice — once as a multiplicand, once as an addend. a has two outgoing edges (one to ab, one to d). This is the diamond pattern, and it's where naive backprop implementations break (we'll see why in 03-worked-backprop.md).

a ──┬──> ab ──┐
    │         ╳ (+)──> d
b ──┘         │
    ┌─────────┘
a (same node)

(I drew a twice in ASCII but it's literally the same object in memory.)

Leaves vs internal nodes

A leaf is a Value whose _prev is empty — i.e., it wasn't produced by any operation, it was constructed directly with Value(some_float). Leaves are typically:

  • Parameters — learnable weights and biases.
  • Inputs — features fed to the model.
  • Constants — hand-set values that don't change.

An internal node has _prev containing at least one parent. It was produced by an op.

This matters because:

  • During backward, leaves' _backward is a no-op (nothing to propagate further — they have no parents).
  • During training, only parameter leaves get updated by the optimizer; input/constant leaves keep their values. (In Phase 9, Parameter will be the explicit marker.)

In Phase 7 we don't distinguish — every Value(...) is a leaf, and the XOR training loop manually applies updates to the weights. Phase 9 introduces Parameter as a Tensor subclass that the optimizer knows about.

What _backward is, exactly

_backward is a function with no arguments that, when called, contributes this node's gradient back to its parents' gradients. It's a closure — it captures the parents by reference at op-creation time.

Concretely, for c = a * b:

def _backward():
    a.grad += b.data * c.grad   # ∂c/∂a · ∂L/∂c = b · ∂L/∂c
    b.grad += a.data * c.grad   # ∂c/∂b · ∂L/∂c = a · ∂L/∂c
c._backward = _backward

Three things to notice:

  1. +=, not =. Because a parent might have multiple children, and each child's contribution must be summed. (Diamond case.)
  2. Reads c.grad. This is why backward must be called in topological order — c.grad must be fully accumulated from c's downstream children before c._backward runs, so the gradient it propagates is correct.
  3. Closure captures a and b by reference. When _backward is called (potentially much later, after many more ops have been added to the graph), it still refers to the same a and b objects. Python's closure semantics make this work; we just need to be careful not to use a loop variable carelessly.

The leaf's _backward is the default no-op: lambda: None. Internal nodes get a meaningful closure when their op constructor runs.

How backward traversal works (topological sort + reverse walk)

The backward algorithm:

def backward(self):
    # 1. Build topological order: parents before children.
    topo = []
    visited = set()
    def build(v):
        if v in visited: return
        visited.add(v)
        for p in v._prev:
            build(p)
        topo.append(v)
    build(self)

    # 2. Seed: ∂L/∂L = 1.
    self.grad = 1.0

    # 3. Walk in reverse: children before parents.
    for v in reversed(topo):
        v._backward()

Let's trace it on d = a*b + c:

  • build(d): visits d's parents (ab, c) first.
  • build(ab): visits ab's parents (a, b) first.
  • build(a): no parents, appends a. topo = [a].
  • build(b): no parents, appends b. topo = [a, b].
  • After ab's parents are done, append ab. topo = [a, b, ab].
  • build(c): no parents, appends c. topo = [a, b, ab, c].
  • Append d. topo = [a, b, ab, c, d].

Reverse: [d, c, ab, b, a]. Seed d.grad = 1. Walk:

  • d._backward(): ab.grad += 1 * d.grad = 1; c.grad += 1 * d.grad = 1. (Add op's local derivative is 1 for both parents.)
  • c._backward(): no-op (leaf).
  • ab._backward(): a.grad += b.data * ab.grad = 3 * 1 = 3; b.grad += a.data * ab.grad = 2 * 1 = 2.
  • b._backward(): no-op.
  • a._backward(): no-op.

Final gradients: a.grad = 3, b.grad = 2, c.grad = 1. Verify by hand: d = ab + c = a*b + c. ∂d/∂a = b = 3 ✓. ∂d/∂b = a = 2 ✓. ∂d/∂c = 1 ✓.

The diamond pattern, in slow motion

d = a*b + a (a used twice):

  • ab = a * b. ab._prev = (a, b).
  • d = ab + a. d._prev = (ab, a).
  • a appears in both (a, b) and (ab, a) — same object both times.

Topo sort (one valid order): [b, a, ab, d]. Reverse: [d, ab, a, b]. Seed d.grad = 1.

  • d._backward(): ab.grad += 1; a.grad += 1. Now a.grad = 1.
  • ab._backward(): a.grad += b.data * ab.grad = b·1 = b; b.grad += a.data * ab.grad = a·1 = a.

After: a.grad = 1 + b. Verify: d = a·b + a. ∂d/∂a = b + 1 ✓.

The two contributions to a.grad arrived in two separate _backward calls. They summed because we used +=. If we'd used =, the second call would have overwritten the first and we'd have a.grad = b, which is wrong.

This is the single most important reason _backward uses +=. Test it explicitly.

Why this is "reverse-mode autodiff"

Two flavors of automatic differentiation:

  • Forward-mode: propagate derivative information forward alongside data. To compute ∂L/∂a for some parameter a, you seed a with derivative 1, set every other parameter's derivative to 0, and propagate forward. Cost: one forward pass per parameter. Bad when you have millions of parameters and one loss.
  • Reverse-mode (this phase): propagate derivative information backward from the loss. One backward pass gives you ∂L/∂every-parameter. Cost: one forward + one backward, regardless of parameter count.

For LLMs: 1 loss, billions of parameters → reverse-mode wins. For physics simulations with thousands of inputs and a single output → reverse-mode wins. For finance with a few inputs and many outputs → forward-mode might win.

The Karpathy rule of thumb: reverse-mode when n_outputs << n_inputs. Forward-mode when n_outputs >> n_inputs. Phase 7 only builds reverse-mode; that's all ML uses.

Pitfalls to lock in (preview of common bugs)

These will bite in lab. Heads-up:

  1. Forgetting += — using = in _backward silently drops contributions for nodes used multiple times. The diamond test catches it.
  2. Re-running backward() without zeroing — gradients accumulate; second call doubles them. Phase 9's optimizer will introduce zero_grad().
  3. Mutating data between forward and backward — the backward closure captured a.data (as a value, since float is immutable) at op-creation time… wait, actually, the closure captures a as an object, and reads a.data at call time. If you mutate a.data between forward and backward, the closure will read the new value. Subtle. Don't mutate in flight.
  4. Topo sort with cycles — can't happen by construction in our framework, but a defensive assert is cheap.
  5. set() containing Value instancesValue needs __hash__ for set membership. Default object hash (by identity) works fine. Don't override __eq__ unless you also override __hash__.

One-paragraph recap

A computation graph is a DAG where each non-leaf node is associated with an operation and a closure that contributes the node's gradient back to its parents. The graph is built implicitly during the forward pass by overriding Python's arithmetic operators; the backward pass is a topological sort followed by a reverse walk, calling each node's _backward closure to accumulate gradients via the chain rule. The += accumulation handles the diamond case where one node has multiple children. This is reverse-mode automatic differentiation — the right choice for ML because we have many parameters and one scalar loss.


Next: 02-op-derivatives.md