English · Español

03 — Rotary Position Embedding (RoPE)¶

🇪🇸 RoPE rota las parejas de dimensiones \((2k, 2k+1)\) de Q y K por un ángulo que depende de la posición. La propiedad mágica: el producto escalar de dos vectores rotados \(\langle R_m q, R_n k \rangle\) depende solo de la diferencia \(n - m\). Así, el modelo ve posición relativa sin necesidad de codificarla explícitamente. Esa propiedad es la razón por la que RoPE ganó.

This file derives RoPE, proves its key property, and shows the vectorized implementation that production code uses.

Setup¶

We want a positional encoding scheme with two properties:

The attention score \(Q_i \cdot K_j\) depends only on \(i - j\). Relative-position information emerges naturally.
No extra parameters. PE should be a deterministic function of position.

Sinusoidal does (2) but only weakly (1). Learned does neither. RoPE does both, by multiplying rather than adding.

The 2D case (intuition)¶

Take a single 2D pair \((q_0, q_1)\). Rotate by angle \(\theta\) using the standard 2D rotation matrix:

\[ R_\theta \begin{pmatrix} q_0 \\ q_1 \end{pmatrix} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} q_0 \\ q_1 \end{pmatrix} = \begin{pmatrix} q_0 \cos\theta - q_1 \sin\theta \\ q_0 \sin\theta + q_1 \cos\theta \end{pmatrix} \]

This is a rotation by \(\theta\) in the \((q_0, q_1)\) plane. It preserves \(\|q\|\) (orthogonal transformation).

Now apply rotations parameterized by position: at position \(m\), rotate \(q\) by \(m \omega\) (some frequency \(\omega\)). At position \(n\), rotate \(k\) by \(n \omega\).

The inner product of the rotated vectors:

\[ \langle R_{m\omega} q, R_{n\omega} k \rangle \]

Use a key property of rotations: \(\langle R_\alpha u, R_\beta v \rangle = \langle u, R_{\beta - \alpha} v \rangle\) (rotations preserve inner product, and you can move the difference angle to one side).

Therefore:

\[ \boxed{\; \langle R_{m\omega} q, R_{n\omega} k \rangle = \langle q, R_{(n - m)\omega} k \rangle \;} \]

The score depends only on the difference \(n - m\), not on \(m\) or \(n\) individually. That's the magic.

Stacking across dimensions¶

Real Q and K vectors are not 2D — they're \(d_\text{head}\)-dimensional. Apply the rotation independently to each consecutive 2D pair \((2k, 2k+1)\), each at its own frequency:

\[ \omega_k = \frac{1}{10000^{2k/d_\text{head}}} \quad \text{for } k = 0, 1, \ldots, d_\text{head}/2 - 1 \]

(Same frequency schedule as sinusoidal — geometric.)

For position \(p\):

\[ R_p = \text{block-diag}(R_{p \omega_0}, R_{p \omega_1}, \ldots, R_{p \omega_{d_\text{head}/2 - 1}}) \]

— a block-diagonal matrix of 2D rotations, each at its own frequency.

The full RoPE-rotated Q (at position \(p\)):

\[ Q^{\text{rope}}_p = R_p Q_p \]

Similarly for K. Apply attention to \((Q^{\text{rope}}, K^{\text{rope}}, V)\) — V is not rotated.

The key property generalizes: each pair contributes a relative-position-dependent term, and the sum across pairs gives a function of \(n - m\) only.

Why V is not rotated¶

V carries the values to be returned. Rotating V would mean "the value at position \(p\) depends on \(p\) in a structured way" — but V is supposed to be content, not position. By keeping V un-rotated, we preserve the property that routing depends on position (Q-K interaction) but content delivered is position-independent (V).

This separation is what makes RoPE clean: position is part of the attention mechanism, not part of the data being routed.

🇪🇸 Por qué no rotar V: la rotación es para la atención (decidir a quién mirar), no para el contenido (qué llevarse). V se queda igual; Q y K se rotan.

The implementation trick¶

Computing \(R_p Q\) as a block-diagonal matmul is wasteful — most entries are zero. The trick: write it elementwise.

For pair \((2k, 2k+1)\) at position \(p\):

\[ q'_{2k} = q_{2k} \cos(p \omega_k) - q_{2k+1} \sin(p \omega_k) \]

\[ q'_{2k+1} = q_{2k} \sin(p \omega_k) + q_{2k+1} \cos(p \omega_k) \]

Vectorize over the whole tensor. Let cos_pe[p, k] = cos(p * omega_k) and sin_pe[p, k] = sin(p * omega_k), both shape (T, d_head // 2). Then:

# q shape: (T, d_head)
q_even = q[:, 0::2]  # (T, d_head // 2)
q_odd  = q[:, 1::2]  # (T, d_head // 2)
q_rot_even = q_even * cos_pe - q_odd * sin_pe   # (T, d_head // 2)
q_rot_odd  = q_even * sin_pe + q_odd * cos_pe   # (T, d_head // 2)
# Interleave back to (T, d_head)
q_rope = np.empty_like(q)
q_rope[:, 0::2] = q_rot_even
q_rope[:, 1::2] = q_rot_odd

Or equivalently using the "rotate by 90°" helper:

def rotate_half(x):
    """Split x = [x_a, x_b] along last dim, return [-x_b, x_a]."""
    d = x.shape[-1]
    x_a = x[..., : d // 2]
    x_b = x[..., d // 2 :]
    return np.concatenate([-x_b, x_a], axis=-1)

# Apply: q * cos + rotate_half(q) * sin

The second form is what most production implementations use. The pair-interleaving is implicit in the splitting convention. Lab 02 implements both and verifies they agree.

🇪🇸 Detalle de implementación: hay dos convenciones para emparejar dimensiones — adyacentes \((2k, 2k+1)\) o por mitades \((k, k + d/2)\). Lab 02 deja claro cuál usamos. Ambas son matemáticamente equivalentes; el código y las pre-computaciones de cos/sin deben ser consistentes.

Verifying the relative-position property¶

The property states: \(\langle R_m q, R_n k \rangle = \langle q, R_{n-m} k \rangle\).

Numerical check (lab 02 implements this):

Pick random \(q, k \in \mathbb{R}^{d_\text{head}}\).
Pick positions \(m = 5, n = 7\) (so \(n - m = 2\)).
Compute \(\text{LHS} = \langle R_5 q, R_7 k \rangle\).
Compute \(\text{RHS} = \langle q, R_2 k \rangle\).
Assert \(|\text{LHS} - \text{RHS}| < 10^{-5}\).

If your implementation is correct, this holds for any choice of \(m, n, q, k\). If it doesn't, the rotation matrices are wrong.

What about the multi-head dimension?¶

Each head has its own \(d_\text{head}\). RoPE is applied per head, with its own frequency schedule on \(d_\text{head}\).

If \(d_\text{model} = 64\) and \(n_\text{heads} = 4\), then \(d_\text{head} = 16\), so each head gets 8 frequency pairs. The frequencies \(\omega_k = 1 / 10000^{2k/16}\) for \(k = 0, \ldots, 7\).

Different heads have the same frequency schedule. Specialization between heads comes from learned \(W_Q, W_K\) — RoPE just adds positional structure on top.

Extrapolation argument¶

Why does RoPE extrapolate? Two threads:

The relative-position property holds at any \(m, n\). Whether \(m = 5, n = 7\) (training) or \(m = 5000, n = 5002\) (extrapolation), the difference \(n - m = 2\) controls the score. The model's learned \(W_Q W_K^\top\) trained to handle relative-position-2 patterns; that knowledge transfers.
The PE values stay bounded. Sinusoidal PE values are in \([-1, 1]\) for all positions. RoPE rotations preserve norm. There's no "drift to infinity" as \(p\) increases. Models can be evaluated at unseen positions with stable activations.

This is better than sinusoidal because sinusoidal adds position to the embedding, then the model has to learn to separate position from content. RoPE keeps position in the attention mechanism, content in V. The separation is enforced by construction.

🇪🇸 Resumen extrapolation: RoPE separa "qué decir" (V) de "a quién atender" (Q, K). La posición sólo afecta la atención (multiplicativamente, vía rotación). Esto se generaliza a posiciones no vistas mejor que cualquier scheme que mezcle posición y contenido aditivamente.

Practical caveats¶

Long-context fine-tuning. Models trained on \(T = 1024\) with RoPE work at \(T = 4096\), but they degrade. Long-context "fine-tuning" (Phase 19+ topic) helps.
Position interpolation. A common trick: at inference, scale positions by \(T_\text{train} / T_\text{infer}\). Keeps frequencies in the trained range. Lab 03 mentions but doesn't implement.
NTK-aware scaling. A more sophisticated version of position interpolation that respects the high-frequency components. Out of scope.

API (locked for `src/minimodel/positional/rope.py`)¶

def rope_frequencies(d_head: int, base: float = 10000.0) -> np.ndarray:
    """Return ω_k for k = 0, ..., d_head/2 - 1. Shape (d_head/2,)."""
    ...

def precompute_rope(T: int, d_head: int) -> tuple[np.ndarray, np.ndarray]:
    """Return cos_pe, sin_pe of shape (T, d_head/2) each."""
    ...

def apply_rope(
    q: np.ndarray,        # (T, d_head)
    k: np.ndarray,        # (T, d_head)
    cos_pe: np.ndarray,   # (T, d_head/2)
    sin_pe: np.ndarray,   # (T, d_head/2)
) -> tuple[np.ndarray, np.ndarray]:
    """Apply RoPE to q and k. Return (q_rope, k_rope) of same shape."""
    ...

Phase 17 integrates this into MultiHeadAttention.forward as an optional positional encoding mode.

Recap¶

RoPE rotates Q and K by position-dependent angles (frequencies geometrically spaced, same schedule as sinusoidal).
The key property: \(\langle R_m q, R_n k \rangle\) depends only on \(n - m\).
V is not rotated — position affects routing only, not content.
Implementation is element-wise: \(q \cdot \cos + \text{rot}(q) \cdot \sin\). Vectorized.
Extrapolates because the relative-position dependency holds at any position pair.
Lab 02 verifies the relative-position property numerically.

You've now read all four Phase 16 theory files. Before opening the lab:

Write the RoPE rotation formula from memory (2D case).
State the key property in one sentence.
Explain why V is not rotated.

If any of these feel wobbly, re-read the relevant section.

Next: end of theory. Proceed to ../lab/00-permutation-equivariance.md.