Skip to content

English · Español

Phase 15 — Attention from Scratch

Requires: 13 — Embeddings & Representation Spaces · 14 — Pre-Transformer Sequence Models Teaches: attention · scaled-dot-product · multi-head · causal-mask · query-key-value Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12; pivoted to English verb grammar per A13. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Attention es el truco central del transformer y, por extensión, de todo lo que llamamos "IA moderna". Esta fase deriva la ecuación \(\text{softmax}(QK^\top / \sqrt{d_k}) V\) desde primeros principios — no es una fórmula mágica, son tres proyecciones aprendibles y un producto escalar normalizado. Si esta fase aterriza, el resto del currículo se sostiene; si no, todo lo siguiente es magia.


Goal

Derive scaled dot-product attention from first principles, implement multi-head causal attention in NumPy, and verify against a hand-derived reference. By the end of Phase 15:

  1. Borja can write the attention equation from memory.
  2. Borja can explain — concretely — why each piece (Q, K, V, \(\sqrt{d_k}\), softmax, multi-head, mask) exists.
  3. src/minimodel/attention/ exists, implements multi-head causal attention in ~100 LOC of NumPy, and is the foundation for Phases 17, 18, 22, 25, 27.

This is the central derivation phase of the curriculum. Take it slowly. The depth here is intentional — five theory files, four lab files. Plan for 15–25 study hours.

The worked example

Throughout Phase 15, the canonical short sequence is over English verb conjugations:

tokens:    I    work    ,    you    work    ,    he    ___

The model has to predict the masked token (___). The right answer is works (3rd-person singular agreement with he).

An ideal trained attention head responsible for person agreement should produce a row in the attention matrix where the position of ___ attends mostly to he (the immediately-preceding pronoun), with smaller weight on other tokens. This is the qualitative target we'll see — not in Phase 15 itself (untrained model = random weights = random attention), but in Phase 18 after training. Phase 15 builds the mechanism; Phase 18 makes it learn this pattern.

Read order

  1. theory/00-motivation.md — what attention is for. Information routing. Why RNNs (Phase 14) failed. The "ideal attention" pattern on our worked example.
  2. theory/01-query-key-value.md — the three projections derived from a dictionary-lookup analogy. Grounded in: pronoun-as-key, verb-position-as-query, verb-form-as-value.
  3. theory/02-scaled-dot-product.md — the full \(\text{softmax}(QK^\top / \sqrt{d_k}) V\) derivation with variance argument and numerical-stability rewrite. Read twice.
  4. theory/03-multi-head.md — head dimension; the "multiple specialists" view; why concatenation + output projection. Example: one head for person agreement, one for tense, one for English↔Spanish alignment.
  5. theory/04-masking.md — causal mask, padding mask; additive \(-\infty\) vs multiplicative zero.
  6. lab/00-attention-by-hand.md — derive a 2-token, 1-head example on paper; reproduce in NumPy.
  7. lab/01-multi-head-attention.md — extend to multi-head; verify against single-head when \(H = 1\). Visualize each head's pattern on the canonical 8-token example.
  8. lab/02-causal-mask.md — add masking; verify by perturbation that future tokens don't influence past outputs.
  9. lab/03-attention-perf.md — profile attention; identify the memory-bound vs compute-bound regimes; flag Phase 27 (Flash Attention) as future work.

solutions/ is empty during pre-write — populated at phase open after Borja's prior-phase API decisions are visible.

Definition of Done

See PHASE_15_PLAN.md §6. Briefly:

  • src/minimodel/attention/attention.py exists with single-head function and MultiHeadAttention class.
  • Hand-derived 2-token example matches NumPy implementation to 1e-5.
  • Causal-mask correctness test passes (perturbing future doesn't change past).
  • Attention heatmap on the canonical 8-token sequence committed (untrained — shape, not semantics).
  • /quiz 15 passed at ≥ 80%.

What this phase intentionally does NOT cover

  • PyTorch comparison. Deferred to Phase 25 per anti-goal §10. Spec's "match PyTorch to 1e-5" is satisfied by matching a hand-derived reference instead.
  • Flash Attention. Phase 27. Phase 15 implements the naive \(O(T^2)\) form on purpose so Phase 27's memory-traffic argument has a target to compare against.
  • Linear attention / Performer / Linformer. Phase 36 (frontier architectures) or never.
  • Training the attention layer. Phase 18. Phase 15 is forward-pass + correctness only.
  • Cross-attention beyond a one-line note. No encoder-decoder in this curriculum.
  • Quantized attention. Phase 26.
  • Sliding-window / local attention. Out of scope; mentioned in passing.
  • Position information. Phase 16. Phase 15 explicitly assumes the input embedding already encodes position (or that position invariance is fine for the toy examples).
  • Trained-attention pattern visualization on the canonical example. Phase 18 — once we have a trained model, we re-visualize and confirm the "head attends to he when predicting works" pattern emerges. Phase 15 only commits random-attention heatmaps.

Phase 15's scope is scaled dot-product attention as the canonical mechanism — its derivation, its NumPy implementation, its correctness. Nothing more. Everything else is downstream.

Further reading

Optional — enrichment, not required to pass the phase.