English · Español

Phase 16 — Positional Encodings¶

Requires: 15 — Attention from Scratch Teaches: positional-encoding · rope · sinusoidal · extrapolation Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12; English verb grammar scope per A13. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Attention sin posiciones es permutación-equivariante: para el modelo, I work he y he work I son el mismo conjunto de tokens reordenado — pero la respuesta correcta a ... he ___ (works) depende crucialmente de qué viene antes del slot. Esta fase añade la información posicional como una pieza separada al embedding (o, en RoPE, como una rotación de Q y K). Implementamos tres variantes y comparamos extrapolación más allá del largo de entrenamiento.

Goal¶

Implement three positional encoding schemes — sinusoidal, learned, RoPE — and compare their behavior. By the end of Phase 16:

Borja can demonstrate permutation-equivariance of attention with a 3-token counter-example (on a verb-grammar fragment like I work he).
All three PE schemes exist in src/minimodel/positional/.
The RoPE relative-position property is verified numerically.
Borja has a written argument for why Phase 17's Mini-GPT should use RoPE (preferred default) or sinusoidal (fallback).

Phase 16 is a tool-shed phase. Three implementations, one comparison experiment, decide which to use downstream. Plan for 8–15 study hours.

Read order¶

theory/00-motivation.md — permutation-equivariance, the problem we're solving.
theory/01-sinusoidal.md — Vaswani's original PE. Derive the formula.
theory/02-learned-vs-sinusoidal.md — when each wins; why both fail at extrapolation; the gap RoPE/ALiBi fill.
theory/03-rope.md — RoPE derivation, the relative-position property, the implementation trick.
lab/00-permutation-equivariance.md — prove attention without PE doesn't distinguish word order.
lab/01-sinusoidal-pe.md — implement and visualize sinusoidal PE.
lab/02-rope-implementation.md — implement RoPE; verify the relative-position property numerically.
lab/03-extrapolation-compare.md — head-to-head comparison beyond training length.

solutions/ is empty during pre-write.

Definition of Done¶

See PHASE_16_PLAN.md §6. Briefly:

Three PE schemes implemented.
Permutation-equivariance counter-example committed.
RoPE numerical verification passes.
Extrapolation comparison plot committed; pick a winner for Phase 17.

What this phase intentionally does NOT cover¶

T5-style relative position biases. One-paragraph mention in theory/02. Not implemented.
ALiBi. One-paragraph mention. Not implemented (optional Borja extension).
xPos, NoPE, other 2023+ variants. Out of scope.
2D positions (vision transformers). Phase 22 territory if at all.
Position-aware training. We don't train in Phase 16. Forward-pass and pattern-comparison only.

Phase 16's scope is the three canonical positional encodings and their extrapolation behavior. Nothing more.