English · Español
Phase 16 — Positional Encodings¶
Requires: 15 — Attention from Scratch Teaches:
positional-encoding·rope·sinusoidal·extrapolationJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12; English verb grammar scope per A13. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Attention sin posiciones es permutación-equivariante: para el modelo,
I work heyhe work Ison el mismo conjunto de tokens reordenado — pero la respuesta correcta a... he ___(works) depende crucialmente de qué viene antes del slot. Esta fase añade la información posicional como una pieza separada al embedding (o, en RoPE, como una rotación de Q y K). Implementamos tres variantes y comparamos extrapolación más allá del largo de entrenamiento.
Goal¶
Implement three positional encoding schemes — sinusoidal, learned, RoPE — and compare their behavior. By the end of Phase 16:
- Borja can demonstrate permutation-equivariance of attention with a 3-token counter-example (on a verb-grammar fragment like
I work he). - All three PE schemes exist in
src/minimodel/positional/. - The RoPE relative-position property is verified numerically.
- Borja has a written argument for why Phase 17's Mini-GPT should use RoPE (preferred default) or sinusoidal (fallback).
Phase 16 is a tool-shed phase. Three implementations, one comparison experiment, decide which to use downstream. Plan for 8–15 study hours.
Read order¶
theory/00-motivation.md— permutation-equivariance, the problem we're solving.theory/01-sinusoidal.md— Vaswani's original PE. Derive the formula.theory/02-learned-vs-sinusoidal.md— when each wins; why both fail at extrapolation; the gap RoPE/ALiBi fill.theory/03-rope.md— RoPE derivation, the relative-position property, the implementation trick.lab/00-permutation-equivariance.md— prove attention without PE doesn't distinguish word order.lab/01-sinusoidal-pe.md— implement and visualize sinusoidal PE.lab/02-rope-implementation.md— implement RoPE; verify the relative-position property numerically.lab/03-extrapolation-compare.md— head-to-head comparison beyond training length.
solutions/ is empty during pre-write.
Definition of Done¶
See PHASE_16_PLAN.md §6. Briefly:
- Three PE schemes implemented.
- Permutation-equivariance counter-example committed.
- RoPE numerical verification passes.
- Extrapolation comparison plot committed; pick a winner for Phase 17.
What this phase intentionally does NOT cover¶
- T5-style relative position biases. One-paragraph mention in
theory/02. Not implemented. - ALiBi. One-paragraph mention. Not implemented (optional Borja extension).
- xPos, NoPE, other 2023+ variants. Out of scope.
- 2D positions (vision transformers). Phase 22 territory if at all.
- Position-aware training. We don't train in Phase 16. Forward-pass and pattern-comparison only.
Phase 16's scope is the three canonical positional encodings and their extrapolation behavior. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al. · 2021. RoPE, the positional scheme modern LLMs use.
- 📄 Train Short, Test Long: Attention with Linear Biases (ALiBi) — Press, Smith, Lewis · 2021. the extrapolation-friendly alternative to RoPE.