English · Español
00 — Motivation: attention as differentiable information routing¶
The attention pipeline, end to end¶
🇪🇸 Attention no es "magia"; es un mecanismo de enrutamiento diferenciable. Cada token de salida es una combinación ponderada de todos los tokens de entrada, donde los pesos los aprende el modelo. Eso resuelve los dos problemas que la Fase 14 dejó abiertos: gradiente largo (cada par token-token tiene un camino directo) y paralelismo (el cálculo se hace de golpe sobre toda la secuencia).
This file answers: what problem is attention solving?
The two problems Phase 14 left open¶
At the end of Phase 14 we had two precise failures of recurrent models:
- Vanishing/exploding gradient through time. The gradient from a loss at step \(T\) back to an input at step \(t\) flows through \((T - t)\) matrix multiplications by \(W_{hh}\). The product vanishes or explodes geometrically.
- Serial compute over the sequence axis. Computing \(h_t\) requires \(h_{t-1}\). The chain is inherently sequential; no parallelism possible across the time axis.
The GRU/LSTM patches problem 1 (additive path) but does nothing for problem 2. Attention solves both, by replacing the recurrence entirely.
The reframing: routing, not recurrence¶
The recurrent paradigm says: "to produce output \(y_t\), summarize the prefix into a fixed-size state \(h_{t-1}\), then combine with \(x_t\)".
Attention says: "to produce output \(y_t\), look at every prior token \(x_1, \ldots, x_t\), and decide how much each contributes".
Concretely: \(y_t\) is a weighted sum of (transformed) prior token representations. The weights \(\alpha_{t,j}\) for \(j = 1, \ldots, t\) tell us how much position \(j\) contributes to the output at position \(t\). The weights are computed by the model — they're learned — so attention is a differentiable, content-addressed lookup.
In equation form (we'll derive this carefully in theory/02-scaled-dot-product.md):
where \(v_j\) is the "value" associated with input position \(j\), and \(\alpha_{t,j}\) is the attention weight from position \(t\) to position \(j\).
The weights \(\alpha_{t,j}\) sum to 1 over \(j\) (they form a probability distribution), and they're computed by a separate small network on the fly. The trick is that this small network is itself parameterized by a few learnable matrices, and its outputs are content-dependent — different prefixes produce different weights.
Why this fixes problem 1 (gradient)¶
Backpropagation from \(y_t\) to input \(x_j\) flows through one softmax weight \(\alpha_{t,j}\). Not 50 matrix multiplications. One soft routing decision.
This is why transformers can handle context lengths of thousands of tokens that RNNs cannot. The gradient from token 1000 to token 1 passes through the attention weights of a single layer — every pair has a direct path. No vanishing.
Why this fixes problem 2 (parallelism)¶
To compute \(y_1, y_2, \ldots, y_T\), you compute all \(\alpha_{t,j}\) at once: that's a \(T \times T\) matrix. Then you compute the weighted sums: that's a \(T \times T\) times \(T \times d_v\) matmul.
No step depends on the previous step's output. Modern hardware loves this — it's two big matrix multiplications, parallelizable across cores, GPUs, or TPUs trivially.
The RNN's training loop is a for t in range(T) over the sequence axis. The transformer's training loop is one matmul over the sequence axis. For \(T = 1000\) and a GPU with 10000+ cores, that's the difference between 1000 sequential micro-operations and 1 fat matmul.
What attention looks like on our worked example¶
The canonical Phase 15 example is the 8-token sequence:
The task is to predict the token at position 7. The right answer is works (3rd-person singular present-simple of work, agreeing with he).
What should attention from position 7 look like? Think mechanically:
- To choose the correct verb form, the model needs to know the subject pronoun. That's
heat position 6. - The model also needs to know the verb stem. The most recent verb stem is
workat position 4 (or 1 — both are valid templates). - The model needs to know the tense pattern. From the absence of
will/is going toand the presence of pronouns followed by bare verbs, this is present-simple.
So an ideal attention head dedicated to "predict the next verb form" should produce, at position 7, a weight distribution that puts most mass on positions 6 (he) and 4 (work), with maybe some mass on 1 (the earlier work). The weights at positions 2, 3, 5 (separators and the redundant pronoun you) should be lower.
That's what attention does. A weight distribution over prior positions, learned to highlight the relevant ones for the prediction at hand.
Critical caveat: untrained attention is random. Phase 15 only verifies the mechanism. Phase 18 trains it. Phase 18 will show that this learned pattern actually emerges.
Multiple heads = multiple specialists¶
A single attention head produces one weighted average. For a complex prediction, the model often needs to combine multiple kinds of context — e.g.:
- A head that attends to the subject pronoun (for person agreement).
- A head that attends to the auxiliary token (for tense identification).
- A head that attends to the English↔Spanish alignment (
I work / yo→ predicttrabajo).
Multi-head attention runs \(H\) attention computations in parallel, each with its own Q, K, V projections, then concatenates their outputs. The model can use different heads for different routing patterns.
In Phase 17, the Mini-GPT will use \(H = 4\) heads. We'll observe in Phase 18 (post-training) whether the heads naturally specialize as described, or whether the assignment is more diffuse. (Spoiler: at this scale, head specialization is partial — clean head-by-head interpretation is a research topic, not a guaranteed outcome.)
The bridge to Phase 16¶
Attention has one crucial property that will need to be patched in Phase 16:
Attention is permutation-equivariant.
That is: if you permute the input tokens, the output is permuted by the same permutation. The model doesn't know about order, because the attention computation has no notion of which position \(j\) is in relation to position \(t\) — only of what's at position \(j\) (the value \(v_j\)) and how much it matches the query at \(t\).
For our verb-grammar task, this is a disaster. I work, you work, he ___ and , he , you work I work are the same set of tokens, just in a different order. Attention without position information cannot distinguish them.
Phase 16 adds positional encoding to fix this. Phase 15 builds the position-blind core. Both are necessary.
Why "differentiable dictionary lookup" is the right intuition¶
A regular dictionary lookup: given a key, return the value at that key. The lookup is discrete — keys match exactly or they don't.
A differentiable lookup: given a query, compute a similarity score between the query and every key. Convert the scores to a probability distribution (softmax). Return a weighted average of the values.
This is what attention does:
- Each input token gets a key (what it is, for matching purposes).
- Each input token gets a value (what to return if matched).
- The current position emits a query (what it's looking for).
- The query is compared to every key (dot product). The scores go through softmax to become a distribution. The output is the weighted sum of values.
The discrete dictionary lookup is the limiting case where one \(\alpha\) is 1 and the rest are 0. The differentiable version smoothly interpolates between "attend to one position" and "attend to all positions uniformly".
The "softness" is what makes it learnable: gradients flow through the softmax to adjust both the keys (so they match the right queries) and the values (so the right information is at each key).
What this phase will not teach you¶
For clarity, the following are out of scope in Phase 15:
- Training attention. Phase 18. We only verify the forward pass and shape correctness here.
- Comparing to PyTorch. Anti-goal §10 (no PyTorch before Phase 24); we match against a hand-derived 2-token example instead. PyTorch cross-check moves to Phase 25.
- Flash attention, paged attention, sparse attention. Phase 27. We implement the naive \(O(T^2)\) form on purpose so Phase 27's optimizations have a target.
- Cross-attention. Mentioned in one line ("same equation, Q from one sequence, K/V from another"). We are building a decoder-only model; cross-attention won't appear elsewhere in the curriculum.
- Positional encoding. Phase 16.
What you should feel by the end of the phase¶
Three sensations:
- The equation as a routing operation. \(\text{softmax}(QK^\top / \sqrt{d_k}) V\) should read, when you look at it, as: "compute pairwise similarities, normalize them, use them as weights for the values". Each piece does one job.
- The variance argument for \(\sqrt{d_k}\). This is the single piece of math in attention that surprises everyone the first time. Once you've derived it, the "why is there a square root" question goes away forever.
- Multi-head as specialization capacity. Without multi-head, attention can only attend in one pattern at a time. With multi-head, the model has \(H\) patterns. Whether they specialize as you'd like is a training question, but the capacity to specialize is built in.
The path through Phase 15¶
- Theory 01 does Q, K, V from the dictionary-lookup analogy. Why three matrices.
- Theory 02 derives the full equation. The \(\sqrt{d_k}\) argument. Softmax stability.
- Theory 03 does multi-head. The "multiple specialists" view.
- Theory 04 does masking. Causal mask, padding mask, additive-vs-multiplicative.
- Labs 00–03 implement everything in NumPy and verify correctness against hand-derived references.
Stop here if¶
You're tempted to skim Phase 15. Don't. Every later phase imports from attention.py. Every later phase assumes you can derive the math. A skim now is two weeks of confusion in Phase 17–22. The depth here is the point.
🇪🇸 La pregunta que tienes que poder responder al salir de esta fase: "¿por qué exactamente attention reemplazó a las RNN?". Si tu respuesta no incluye los dos puntos (gradiente directo + paralelismo sobre la secuencia), la fase no ha aterrizado. Vuelve a leer.
Next: theory/01-query-key-value.md.