English · Español
02 — Learned PE, T5 Biases, ALiBi: When Each Wins and Why None is Enough¶
🇪🇸 Tres alternativas a sinusoidal: aprender la PE como una matriz (gana en interpolation, pierde en extrapolation); sesgar los scores de attention según la distancia (T5, ALiBi — barato y razonable); y RoPE, que combina lo mejor de varios mundos. Esta página justifica por qué Phase 17 elige RoPE y no las otras.
This file is the "how do PE schemes compare" file. Three sections: learned embeddings, attention biases (T5/ALiBi), and the failure mode they all share that RoPE fixes.
Section 1 — Learned positional embeddings¶
Allocate a trainable matrix \(E_p \in \mathbb{R}^{T_\text{max} \times d}\). At position \(p\), the PE is \(E_p[p]\) — an indexed lookup, just like token embedding. Add to the token embedding:
What it buys¶
- Maximum flexibility. The model can learn any positional structure that helps the task. No hard-coded sinusoidal assumption.
- Empirically slightly better on in-distribution sequences than sinusoidal (in the original Vaswani comparison, the two were within 0.1 BLEU).
What it costs¶
- Fixed maximum length. \(T_\text{max}\) is baked into the architecture. Generating a sequence of length \(T > T_\text{max}\) requires either truncation or undefined behavior.
- No extrapolation by construction. The model has seen training data only up to length \(T_\text{max}\). Positions \(T_\text{max}, T_\text{max} + 1, \ldots\) have uninitialized embeddings. Whatever the model does with them is uncontrolled.
- More parameters. \(T_\text{max} \cdot d\) extra. For \(T_\text{max} = 4096, d = 1024\), that's 4M extra parameters. Not huge in absolute terms, but it scales with context length.
Where it's used¶
- BERT, GPT-2 used learned PE.
- Modern LLMs (LLaMA, GPT-4, Mistral) have moved away from it.
The pattern is: learned PE is fine for fixed-context-length models. For models that need to extrapolate to longer contexts than they were trained on, it fails.
🇪🇸 Cuándo elegir aprendida: si tu contexto máximo está fijo y conocido en arquitectura (BERT, modelos cortos). Cuándo no: siempre que quieras extender el contexto en inferencia.
Section 2 — T5-style relative position biases¶
The T5 paper (Raffel et al., 2020) replaced PE entirely with a learned bias on the attention scores:
where \(b: \mathbb{N} \to \mathbb{R}\) is a learned function of the relative position. In practice, \(b\) is a piecewise constant function with a few dozen "bucket" parameters.
What it buys¶
- Position information is purely relative: only \(|i - j|\) matters, not absolute position.
- No modification to Q, K, V. Cheap to apply.
- Bucketing gives a small number of trainable parameters (~32 buckets).
What it costs¶
- The model doesn't know absolute position. For tasks where absolute position matters (e.g., "the first token is [CLS]"), the model loses that signal.
- Bucketing is coarse. Positions 100 and 101 fall in the same bucket; positions 1 and 2 also do. The model treats them as similar in attention bias.
Where it's used¶
- T5 family.
- Not widely used in pure decoder LLMs after ~2022.
We mention this and move on. The bucketing is a Phase 22-or-never topic.
Section 3 — ALiBi (Attention with Linear Biases)¶
Press et al. (2022) proposed: drop PE entirely. Add a linear distance penalty to attention scores:
where \(m\) is a per-head slope (each head has its own, set as \(m_h = 1 / 2^{8 h / H}\)).
What it buys¶
- Extrapolation property. ALiBi extrapolates to arbitrarily long contexts without retraining. Train on \(T = 1024\), infer on \(T = 8192\) — works.
- No PE matrix to maintain.
- Each head can specialize at a different "decay rate" — head 1 sees nearby, head 8 sees far.
What it costs¶
- The bias is fixed and monotone. No oscillation; just a linear decay. Less expressive than RoPE's rotation.
- No absolute position. Same as T5-style biases.
- Empirically slightly behind RoPE on benchmarks where attention patterns are complex.
Where it's used¶
- BLOOM, MPT, some smaller open models.
- Not the modal choice in 2024+ LLMs (RoPE won).
We mention ALiBi and don't implement it. If Borja is curious post-Phase 16, it's ~20 lines of NumPy.
Section 4 — The common failure: extrapolation¶
All three schemes covered so far (sinusoidal, learned, T5/ALiBi) fail at extrapolation in different ways:
- Learned PE: silent failure. Positions beyond \(T_\text{max}\) have undefined embeddings.
- Sinusoidal PE: soft failure. The PE values are well-defined for any position, but the model has only seen training-length contexts during training. The attention patterns at \(T = 2048\) when trained on \(T = 512\) are unstable.
- T5 biases: soft failure. Bucketing extends, but the model didn't see the longer-distance buckets during training.
- ALiBi: doesn't fail. The linear decay extrapolates by construction. ALiBi was the first scheme that genuinely worked at \(T = 4096\) when trained on \(T = 1024\).
RoPE also doesn't fail at extrapolation in the same construct-by-design sense. We derive that in theory/03-rope.md.
The 2022–2024 consensus: for autoregressive LLMs, the two viable choices are RoPE and ALiBi. RoPE has slightly better empirical performance and is the default in LLaMA / Mistral / GPT-4 (inferred); ALiBi is simpler and still in use.
Why Phase 17 picks RoPE¶
Three reasons:
- Extrapolation by design. Same as ALiBi.
- Multiplicative interaction (not additive bias). RoPE rotates Q and K, modifying the geometry of the score, not just adding a constant. This is empirically more expressive.
- It's the modern default. Borja sees RoPE in every modern LLM codebase. Implementing it in Phase 16 is exactly the right preparation.
The fallback if RoPE proves too complex: sinusoidal (not learned). Sinusoidal is paramless, well-understood, and we've already derived it.
Summary table¶
| Scheme | Position info | Extrapolates? | Params | Modern usage |
|---|---|---|---|---|
| Sinusoidal | Added pre-attn | Limited | 0 | Some |
| Learned | Added pre-attn | No (silent fail) | \(T_\text{max} \cdot d\) | Older models |
| T5 biases | Added to scores | Bucketed, limited | ~32 / head | T5 family |
| ALiBi | Added to scores | Yes (linear) | \(H\) slopes | BLOOM, MPT |
| RoPE | Rotates Q, K | Yes (oscillating) | 0 | LLaMA, Mistral, modern |
Phase 17 picks RoPE. Phase 16's lab compares sinusoidal vs RoPE explicitly; learned PE is implemented but only as a reference baseline.
🇪🇸 Decisión: Phase 17 usa RoPE, con sinusoidal como fallback si la integración con multi-head attention resulta demasiado complicada. Learned PE solo como baseline en el lab; no llega al transformer.
A nuance about extrapolation¶
"Extrapolates" is a vague claim. More precise: RoPE and ALiBi maintain a well-defined attention pattern at positions beyond training length. This doesn't mean the model's predictions are good at those positions — only that the attention mechanism itself doesn't break.
Models trained on \(T = 1024\) with RoPE generally do degrade in quality past \(T = 2048\) (their internal representations weren't tuned for very long contexts). True long-context performance requires either: (a) training on long contexts; (b) post-hoc adaptations like position interpolation, NTK-aware scaling. Phase 16 isn't the place for those; mentioning them so Borja knows the term.
Next: 03-rope.md.