Skip to content

English · Español

04 — Autograd graph walk: one backward through the mini-GPT, function tape exposed

🇪🇸 Hacemos un paso forward y backward del mini-GPT de la Fase 17 en PyTorch, y enumeramos la cinta (tape) que autograd construye. Cada operación deja un Function nodo en el grafo; al hacer .backward(), autograd recorre la cinta en orden inverso. Aquí lo vemos en código real y en lista de pasos.

This file extends theory/02-autograd-engine.md with the complete function tape for one backward pass through the Phase-17 mini-GPT. It is a depth-pass exercise in reading PyTorch's autograd graph at a scale small enough to enumerate.


Setup

Take the Phase-17 mini-GPT as a torch.nn.Module:

  • \(d_\text{model} = 64\), \(n_\text{heads} = 4\), \(d_h = 16\), \(n_\text{layers} = 2\), \(d_\text{ff} = 256\).
  • Single sequence: \(T = 8\) tokens.
  • Batch: \(B = 1\).

Forward one batch:

logits = model(input_ids)        # (1, 8, V)
loss = F.cross_entropy(
    logits.view(-1, V),          # (8, V)
    target_ids.view(-1),         # (8,)
)
loss.backward()

The autograd graph for loss.backward() is what we'll walk.

The forward tape — operation by operation

Each torch.Tensor.grad_fn records the operation that produced it. Walking the graph in forward order (read top-down):

  1. input_idsEmbedding.forwardEmbeddingBackward. Output shape: (1, 8, 64).
  2. Position-embed broadcast addAddBackward. Output shape: (1, 8, 64).

For each of \(L = 2\) blocks:

  1. LayerNorm.forward (pre-attn) → NativeLayerNormBackward. Output (1, 8, 64).
  2. Q projection (Linear(64, 64)) → AddmmBackward. Output (1, 8, 64).
  3. K projectionAddmmBackward. Output (1, 8, 64).
  4. V projectionAddmmBackward. Output (1, 8, 64).
  5. Reshape Q to (1, 4, 8, 16)ViewBackward. (No actual data movement; metadata only.)
  6. Reshape K, V similarly → 2× ViewBackward.
  7. Q @ K^T (torch.matmul) → BmmBackward. Output (1, 4, 8, 8).
  8. Scale by 1/sqrt(d_h) = 1/4MulBackward. Output (1, 4, 8, 8).
  9. Causal mask add (broadcasting -inf upper-triangular) → AddBackward.
  10. softmaxSoftmaxBackward. Output (1, 4, 8, 8).
  11. attn @ VBmmBackward. Output (1, 4, 8, 16).
  12. Reshape back to (1, 8, 64)ViewBackward.
  13. Output projection (Linear(64, 64)) → AddmmBackward. Output (1, 8, 64).
  14. Residual add (input from step 2 or previous block) → AddBackward.
  15. LayerNorm.forward (pre-FFN) → NativeLayerNormBackward.
  16. FFN's first Linear (Linear(64, 256)) → AddmmBackward. Output (1, 8, 256).
  17. GELUGeluBackward. Output (1, 8, 256).
  18. FFN's second Linear (Linear(256, 64)) → AddmmBackward. Output (1, 8, 64).
  19. Residual addAddBackward.

For block 2, repeat steps 3-21.

After both blocks:

  1. Final LayerNormNativeLayerNormBackward. Output (1, 8, 64).
  2. LM head (Linear(64, V=512)) → AddmmBackward. Output (1, 8, 512).

For the loss:

  1. view(-1, V)ViewBackward. Output (8, 512).
  2. cross_entropy(logits, targets) decomposes into:
    • log_softmax: → LogSoftmaxBackward. Output (8, 512).
    • NLLLoss: → NllLossBackward. Output () (scalar).

So the forward tape for a 2-layer mini-GPT on \(T = 8, V = 512\) has roughly 41 backward functions (~20 per block + ~5 outside). Each is a node in the autograd graph.

What loss.backward() does — node by node

loss.backward() calls engine.execute(...), which:

  1. Starts with grad_output = torch.ones_like(loss) (i.e., 1.0 for the scalar loss).
  2. Walks the graph in reverse topological order, starting from loss.grad_fn = NllLossBackward.
  3. For each node, calls its backward method with the current grad_output, gets grad_input(s) back, accumulates them into the .grad of the corresponding leaf tensors (parameters / inputs marked requires_grad=True).

Reverse-order walk (read bottom-up):

NllLossBackward(grad=1.0) → ∂loss/∂logits = (softmax - one_hot) / N, shape (8, 512)
LogSoftmaxBackward → (the cross-entropy gradient is folded in)
ViewBackward → (8, 512) reshaped back to (1, 8, 512)
AddmmBackward (LM head):
    ∂loss/∂W_lm_head = h_final^T @ (softmax - one_hot)/N, shape (V, 64)
    ∂loss/∂b_lm_head = sum_T (softmax - one_hot)/N, shape (V,)
    ∂loss/∂h_final = (softmax - one_hot)/N @ W_lm_head, shape (1, 8, 64)
NativeLayerNormBackward (final LN):
    gradient w.r.t. input + gradient w.r.t. scale + gradient w.r.t. bias
... (block 2's full chain in reverse) ...
... (block 1's full chain in reverse) ...
AddBackward (position-embed) → grad gets routed to both inputs
EmbeddingBackward → updates embedding rows for the input tokens

Each *Backward node knows exactly what gradient transformation to apply because it captured the relevant forward inputs (saved tensors) during the forward pass.

Saved tensors — what each node holds

A subtlety: autograd nodes save just enough forward state to compute the backward. For example:

  • MulBackward(a, b) saves both a and b (because \(\partial(ab)/\partial a = b\) needs both).
  • AddBackward saves nothing (because \(\partial(a+b)/\partial a = 1\) needs no inputs).
  • SoftmaxBackward saves the output (because the softmax Jacobian is expressed in terms of softmax(x), not x).
  • GeluBackward saves the input (because the GELU derivative is a function of x).
  • AddmmBackward saves the inputs (input, weight, bias) — all three.

For the mini-GPT, the total saved-tensor memory is approximately:

\[\text{saved} \approx \sum_\text{forward nodes} \text{(input + maybe output size)} \times s\]

For \(T = 8, d_\text{model} = 64, V = 512, L = 2\):

  • Embedding output saved at step 1: \(8 \cdot 64 \cdot 4 = 2\) KiB.
  • LayerNorm outputs (×6 across blocks): \(8 \cdot 64 \cdot 4 \cdot 6 = 12\) KiB.
  • Attention intermediates (QKV outputs, scores, attn weights, attn output): \(\approx 50\) KiB.
  • FFN intermediates: \(\approx 30\) KiB.
  • LM head output: \(8 \cdot 512 \cdot 4 = 16\) KiB.

Total: \(\approx 130\) KiB. The activation memory for one forward pass at \(T = 8\). This matches roughly the KV-cache size from Phase 22 (theory/05-mini-gpt-memory-worked-example.md) — both are \(O(T \cdot d_\text{model} \cdot L)\), just with different constants.

For training, activations dominate memory (not weights, not KV cache). At GPT-3 scale, this is the reason gradient checkpointing exists. At §A13 scale, activations are 130 KiB and we don't need checkpointing — but the methodology is identical.

Visualizing the graph

PyTorch has torchviz:

from torchviz import make_dot
graph = make_dot(loss, params=dict(model.named_parameters()))
graph.render("mini_gpt_backward")  # produces a .pdf

The output is a DAG with ~50 nodes (40 backward + parameters + input tensors). At §A13 scale, the graph is readable on one page. At GPT-2 scale (12 layers), it's a wall.

Common gotchas the tape exposes

  1. In-place ops break the tape. If you do h.add_(residual) instead of h = h + residual, the autograd node for the original h is overwritten. If anyone downstream of h needed the original value, the backward fails with RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation. Phase 25 break 00-... causes this on purpose.
  2. Detach truncates the tape. h.detach() returns a tensor with no grad_fn. Useful for "I want this value but don't propagate gradients through it" (e.g., target Q-network in RL). Used incorrectly, gradients silently vanish.
  3. with torch.no_grad(): doesn't record any grad_fn. Used for inference. If wrapped around a training step, no gradients are computed and .backward() errors.
  4. Retain graph. By default, .backward() frees the saved tensors after backward. If you call .backward() twice on the same loss, the second call fails. Use loss.backward(retain_graph=True) to keep them — but only if you actually need to.

Why this matters

By Phase 25, you have already implemented autograd by hand twice: scalar version (Phase 7), tensor version (Phase 8). PyTorch's autograd is just the production-quality version of what you already wrote. Walking the tape for the mini-GPT shows:

  • The graph is finite, finite-sized, and inspectable.
  • Every concept maps to something you've written.
  • "Autograd magic" was always math; the magic is the bookkeeping.

For the Phase-32 grammar-tutor agent (in Phase 32 itself, we don't write a new autograd; we use PyTorch's). Knowing the tape lets you debug "why is my custom loss not training" by reading the graph rather than guessing.

Citation

Paszke, A. et al. (2017). Automatic differentiation in PyTorch. NeurIPS 2017 Autodiff Workshop. https://openreview.net/forum?id=BJJsrmfCZ — the original PyTorch autograd paper. Section 3 describes the tape; section 4 discusses the dynamic graph. The mini-GPT walk above is one instance of the design they describe.

One-paragraph recap

A forward pass through the 2-layer mini-GPT at \(T = 8\) produces a ~40-node autograd graph; .backward() walks the graph in reverse, propagating \(\partial L / \partial t\) from each node to its inputs using the local Jacobian. Each node saves just enough forward state for its backward; at §A13 scale, total saved-tensor memory is ~130 KiB. In-place ops, .detach(), torch.no_grad(), and the saved-tensor lifecycle are the four mechanisms that change graph behavior. Phase 25's break exercise demonstrates the in-place hazard on this exact graph.


Cross-refs: theory/02-autograd-engine.md (the autograd machinery in general), Phase 7-8 (the hand-built versions), Phase 17 README.md (the mini-GPT spec), Phase 22 theory/05-... (activation vs KV cache memory).