Skip to content

English · Español

02 — Learned PE, T5 Biases, ALiBi: When Each Wins and Why None is Enough

🇪🇸 Tres alternativas a sinusoidal: aprender la PE como una matriz (gana en interpolation, pierde en extrapolation); sesgar los scores de attention según la distancia (T5, ALiBi — barato y razonable); y RoPE, que combina lo mejor de varios mundos. Esta página justifica por qué Phase 17 elige RoPE y no las otras.

This file is the "how do PE schemes compare" file. Three sections: learned embeddings, attention biases (T5/ALiBi), and the failure mode they all share that RoPE fixes.


Section 1 — Learned positional embeddings

Allocate a trainable matrix \(E_p \in \mathbb{R}^{T_\text{max} \times d}\). At position \(p\), the PE is \(E_p[p]\) — an indexed lookup, just like token embedding. Add to the token embedding:

\[ \tilde{x}_p = E[t_p] + E_p[p] \]

What it buys

  • Maximum flexibility. The model can learn any positional structure that helps the task. No hard-coded sinusoidal assumption.
  • Empirically slightly better on in-distribution sequences than sinusoidal (in the original Vaswani comparison, the two were within 0.1 BLEU).

What it costs

  • Fixed maximum length. \(T_\text{max}\) is baked into the architecture. Generating a sequence of length \(T > T_\text{max}\) requires either truncation or undefined behavior.
  • No extrapolation by construction. The model has seen training data only up to length \(T_\text{max}\). Positions \(T_\text{max}, T_\text{max} + 1, \ldots\) have uninitialized embeddings. Whatever the model does with them is uncontrolled.
  • More parameters. \(T_\text{max} \cdot d\) extra. For \(T_\text{max} = 4096, d = 1024\), that's 4M extra parameters. Not huge in absolute terms, but it scales with context length.

Where it's used

  • BERT, GPT-2 used learned PE.
  • Modern LLMs (LLaMA, GPT-4, Mistral) have moved away from it.

The pattern is: learned PE is fine for fixed-context-length models. For models that need to extrapolate to longer contexts than they were trained on, it fails.

🇪🇸 Cuándo elegir aprendida: si tu contexto máximo está fijo y conocido en arquitectura (BERT, modelos cortos). Cuándo no: siempre que quieras extender el contexto en inferencia.

Section 2 — T5-style relative position biases

The T5 paper (Raffel et al., 2020) replaced PE entirely with a learned bias on the attention scores:

\[ S'_{ij} = Q_i \cdot K_j + b(|i - j|) \]

where \(b: \mathbb{N} \to \mathbb{R}\) is a learned function of the relative position. In practice, \(b\) is a piecewise constant function with a few dozen "bucket" parameters.

What it buys

  • Position information is purely relative: only \(|i - j|\) matters, not absolute position.
  • No modification to Q, K, V. Cheap to apply.
  • Bucketing gives a small number of trainable parameters (~32 buckets).

What it costs

  • The model doesn't know absolute position. For tasks where absolute position matters (e.g., "the first token is [CLS]"), the model loses that signal.
  • Bucketing is coarse. Positions 100 and 101 fall in the same bucket; positions 1 and 2 also do. The model treats them as similar in attention bias.

Where it's used

  • T5 family.
  • Not widely used in pure decoder LLMs after ~2022.

We mention this and move on. The bucketing is a Phase 22-or-never topic.

Section 3 — ALiBi (Attention with Linear Biases)

Press et al. (2022) proposed: drop PE entirely. Add a linear distance penalty to attention scores:

\[ S'_{ij} = Q_i \cdot K_j - m \cdot |i - j| \]

where \(m\) is a per-head slope (each head has its own, set as \(m_h = 1 / 2^{8 h / H}\)).

What it buys

  • Extrapolation property. ALiBi extrapolates to arbitrarily long contexts without retraining. Train on \(T = 1024\), infer on \(T = 8192\) — works.
  • No PE matrix to maintain.
  • Each head can specialize at a different "decay rate" — head 1 sees nearby, head 8 sees far.

What it costs

  • The bias is fixed and monotone. No oscillation; just a linear decay. Less expressive than RoPE's rotation.
  • No absolute position. Same as T5-style biases.
  • Empirically slightly behind RoPE on benchmarks where attention patterns are complex.

Where it's used

  • BLOOM, MPT, some smaller open models.
  • Not the modal choice in 2024+ LLMs (RoPE won).

We mention ALiBi and don't implement it. If Borja is curious post-Phase 16, it's ~20 lines of NumPy.

Section 4 — The common failure: extrapolation

All three schemes covered so far (sinusoidal, learned, T5/ALiBi) fail at extrapolation in different ways:

  1. Learned PE: silent failure. Positions beyond \(T_\text{max}\) have undefined embeddings.
  2. Sinusoidal PE: soft failure. The PE values are well-defined for any position, but the model has only seen training-length contexts during training. The attention patterns at \(T = 2048\) when trained on \(T = 512\) are unstable.
  3. T5 biases: soft failure. Bucketing extends, but the model didn't see the longer-distance buckets during training.
  4. ALiBi: doesn't fail. The linear decay extrapolates by construction. ALiBi was the first scheme that genuinely worked at \(T = 4096\) when trained on \(T = 1024\).

RoPE also doesn't fail at extrapolation in the same construct-by-design sense. We derive that in theory/03-rope.md.

The 2022–2024 consensus: for autoregressive LLMs, the two viable choices are RoPE and ALiBi. RoPE has slightly better empirical performance and is the default in LLaMA / Mistral / GPT-4 (inferred); ALiBi is simpler and still in use.

Why Phase 17 picks RoPE

Three reasons:

  1. Extrapolation by design. Same as ALiBi.
  2. Multiplicative interaction (not additive bias). RoPE rotates Q and K, modifying the geometry of the score, not just adding a constant. This is empirically more expressive.
  3. It's the modern default. Borja sees RoPE in every modern LLM codebase. Implementing it in Phase 16 is exactly the right preparation.

The fallback if RoPE proves too complex: sinusoidal (not learned). Sinusoidal is paramless, well-understood, and we've already derived it.

Summary table

Scheme Position info Extrapolates? Params Modern usage
Sinusoidal Added pre-attn Limited 0 Some
Learned Added pre-attn No (silent fail) \(T_\text{max} \cdot d\) Older models
T5 biases Added to scores Bucketed, limited ~32 / head T5 family
ALiBi Added to scores Yes (linear) \(H\) slopes BLOOM, MPT
RoPE Rotates Q, K Yes (oscillating) 0 LLaMA, Mistral, modern

Phase 17 picks RoPE. Phase 16's lab compares sinusoidal vs RoPE explicitly; learned PE is implemented but only as a reference baseline.

🇪🇸 Decisión: Phase 17 usa RoPE, con sinusoidal como fallback si la integración con multi-head attention resulta demasiado complicada. Learned PE solo como baseline en el lab; no llega al transformer.

A nuance about extrapolation

"Extrapolates" is a vague claim. More precise: RoPE and ALiBi maintain a well-defined attention pattern at positions beyond training length. This doesn't mean the model's predictions are good at those positions — only that the attention mechanism itself doesn't break.

Models trained on \(T = 1024\) with RoPE generally do degrade in quality past \(T = 2048\) (their internal representations weren't tuned for very long contexts). True long-context performance requires either: (a) training on long contexts; (b) post-hoc adaptations like position interpolation, NTK-aware scaling. Phase 16 isn't the place for those; mentioning them so Borja knows the term.


Next: 03-rope.md.