Skip to content

English · Español

00 — Why glue now, and what the residual stream is

🇪🇸 Después de cuatro fases construyendo piezas, ahora las pegamos. Lo nuevo no es ninguna operación; es la disposición: el residual stream como autopista central, las sublayers como side-trips que leen, transforman y escriben de vuelta. Esa imagen mental es la pieza más valiosa de Phase 17.

The setup

You've already built every component you need:

Component Phase What it does
Token embedding \(E\) 13 \(\text{token\_id} \mapsto \mathbb{R}^{d_\text{model}}\)
Positional encoding (RoPE) 16 Injects position info inside attention's Q/K
Multi-head attention 15 Mixes information across tokens
Linear / Tensor autograd 7–8 The substrate for FFN and projections
LayerNorm intuition 10 The normalization-then-affine pattern

What Phase 17 adds is arrangement: how these parts compose into a transformer block, how blocks compose into a stack, how the stack connects to a vocabulary head, and — most importantly — what the invariant is across all this composition: the residual stream.

The residual stream

The single most useful mental model for transformers is the residual stream. Picture a stream of vectors \(h \in \mathbb{R}^{d_\text{model}}\), one per token position, flowing from the bottom of the network to the top. The stream starts at the token embedding (\(h^{(0)} = E[\text{token}]\)) and ends at the LM head (\(\text{logits} = h^{(\text{final})} \cdot E^\top\)).

Every sublayer — attention, FFN — is a side-trip: it reads from the stream, does some computation, and adds its output back to the stream. The Pre-LN structure makes this explicit:

\[h_{\text{new}} = h_{\text{old}} + \text{sublayer}(\text{LN}(h_{\text{old}}))\]

Read: "the new state is the old state plus a normalized-and-transformed update." Nothing replaces; everything accumulates. This has profound implications:

  1. Skip connections everywhere. Gradients flow along the residual stream untouched, which is what made training deep networks tractable (He et al. 2015 for the original ResNet insight; Vaswani et al. 2017 for transformers).
  2. Sublayers are additive perturbations. Each block makes a small additive contribution. The stream is the "shared workspace"; sublayers write into it.
  3. Information persists. A fact written at layer 3 is still there at layer 11 unless a later sublayer subtracts it.
  4. Width is fixed, depth varies. The stream has the same width \(d_\text{model}\) from bottom to top. Adding layers makes the network deeper, not wider.

This perspective comes from Anthropic's mechanistic interpretability work (Elhage et al. 2021, "A Mathematical Framework for Transformer Circuits"). It's the mental model worth holding. Memorise it.

What attention and FFN actually do, from the residual stream view

  • Attention sublayer. Reads the stream at every token position, computes a weighted mix across positions, writes the mix back. It is the token-mixer. Without attention, the network has no way to share information across positions.
  • FFN sublayer. Reads the stream at one token position, applies a pointwise nonlinear map, writes the result back. It is the feature-mixer. Without FFN, the model is just a linear function of the input embeddings (composition of linear attentions is still linear in the embeddings, modulo softmax).

The order — attention first, then FFN — is canonical. They form a pair: gather information across positions (attention), then refine it pointwise (FFN). Repeat \(n_\text{layers}\) times.

Why "Mini" — the locked config

We lock the smallest interesting configuration:

d_model     = 64       # residual stream width
n_heads     = 4        # so each head is d_model / n_heads = 16-dim
n_layers    = 2        # two stacked blocks
d_ff        = 256      # FFN inner dim, = 4 * d_model (canonical)
vocab_size  = 64       # 60-token English+Spanish verb-form vocabulary, padded
context_len = 32       # max sequence length (canonical example is 8 tokens)

This is small enough to fit on Borja's laptop and small enough to trace by hand. Lab 02 will count every parameter and find ~57k. For comparison, GPT-2 small is 124M (~2000× larger). The point of Mini-GPT is not capability; it's complete understanding of every parameter.

What you should be able to do after this phase

  1. Trace a forward pass on paper. Given the canonical 8-token sequence and a (mock) initialized model, you should be able to write down the shape and a rough sketch of the value at every intermediate tensor — including inside attention and FFN — without consulting code.
  2. Predict parameter counts from architecture. Given \(d_\text{model}, n_\text{heads}, n_\text{layers}, d_\text{ff}, |V|\), you can write down the parameter count to the digit, including or excluding biases.
  3. Read transformer code in the wild. GPT-2, LLaMA, Pythia — all variations on the same Pre-LN theme. You should be able to open any of them, ignore the framework noise, and recognise the residual stream + sublayer pattern.
  4. Spot the difference between Pre-LN and Post-LN. And know why Pre-LN won.

The Phase 17 → Phase 18 handoff

Phase 17 produces:

  • A frozen MiniGPT class with a working forward pass.
  • A locked architecture config (config.yaml) — never changed in Phase 18.
  • A parameter inventory you trust to the digit.
  • A forward-pass numerical reference (lab 03 of Phase 17).

Phase 18 picks this up and adds: cross-entropy loss, training loop, gradient computation, optimizer, learning-rate schedule, gradient clipping, checkpointing, the loss-curve plot. The forward pass itself does not change.

What this file does NOT cover

  • The math of LayerNorm. Sketched in §0 of 01-transformer-block.md, but the full derivation is back in Phase 10.
  • The math of attention. Phase 15. Phase 17 treats attention as a black box: it has Q, K, V, output projection, RoPE-applied positions, a causal mask, and a fixed shape.
  • Initialization choices. Phase 18. Phase 17 uses simple Gaussian init from Phase 10.

Next: 01-transformer-block.md