English · Español

01 — Sinusoidal Positional Encoding¶

🇪🇸 Vaswani et al. (2017) usaron una codificación posicional fija basada en senos y cosenos a frecuencias geométricamente espaciadas. La idea: cada dimensión del embedding corresponde a una frecuencia distinta, y por tanto la PE de la posición \(p+\Delta\) es una rotación lineal de la PE de la posición \(p\). Esto le permite al modelo aprender a usar posiciones relativas sin que se las den explícitamente.

This file derives the sinusoidal PE formula, explains why each piece (sin/cos pairing, geometric frequencies, base \(10000\)) was chosen, and shows the linear-shift property that makes it work.

The formula¶

For position \(p \in \{0, 1, \ldots, T-1\}\) and dimension index \(i \in \{0, 1, \ldots, d-1\}\):

\[ \text{PE}(p, 2k) = \sin(p \cdot \omega_k), \qquad \text{PE}(p, 2k+1) = \cos(p \cdot \omega_k) \]

where \(\omega_k = \frac{1}{10000^{2k/d}}\) for \(k = 0, 1, \ldots, d/2 - 1\).

The result is a \(T \times d\) matrix. The token embedding becomes:

\[ \tilde{x}_p = E[t_p] + \text{PE}(p) \]

Three design choices, each justified below: sin/cos pairing, geometric frequency spacing, base \(10000\).

Design choice 1 — sin/cos pairing¶

Dimensions are paired: \((2k, 2k+1)\) together form a sine-cosine duo at the same frequency \(\omega_k\).

Why? The linear-shift property. Given \(\text{PE}(p)\), what is \(\text{PE}(p + \Delta p)\)? Using angle-addition:

\[ \sin((p + \Delta p) \omega_k) = \sin(p \omega_k) \cos(\Delta p \omega_k) + \cos(p \omega_k) \sin(\Delta p \omega_k) \]

\[ \cos((p + \Delta p) \omega_k) = \cos(p \omega_k) \cos(\Delta p \omega_k) - \sin(p \omega_k) \sin(\Delta p \omega_k) \]

In matrix form, for the dimension pair \((2k, 2k+1)\):

\[ \begin{pmatrix} \text{PE}(p + \Delta p, 2k) \\ \text{PE}(p + \Delta p, 2k+1) \end{pmatrix} = \begin{pmatrix} \cos(\Delta p \omega_k) & \sin(\Delta p \omega_k) \\ -\sin(\Delta p \omega_k) & \cos(\Delta p \omega_k) \end{pmatrix} \begin{pmatrix} \text{PE}(p, 2k) \\ \text{PE}(p, 2k+1) \end{pmatrix} \]

So \(\text{PE}(p + \Delta p)\) is a linear function of \(\text{PE}(p)\), with the rotation matrix depending only on \(\Delta p\) (not on \(p\) itself).

Consequence: the model can, in principle, learn a linear projection that extracts "shift by \(\Delta\)" from any positional encoding. Relative position is encoded.

Without the sin/cos pairing — say, using only sines — you'd lose this property (a sine alone doesn't transform linearly under shift; you need both sin and cos to span the 2D rotation).

🇪🇸 Punto clave: la pareja sin/cos por dimensión es lo que hace que "posición \(p+\Delta\)" se exprese como una rotación lineal de "posición \(p\)". El modelo puede aprender a usar esa rotación para razonar sobre posiciones relativas.

Design choice 2 — geometric frequency spacing¶

The frequencies are \(\omega_k = 1 / 10000^{2k/d}\). So \(\omega_0 = 1, \omega_{d/2 - 1} \approx 1/10000\). Frequencies span four orders of magnitude geometrically across the dimensions.

Two reasons:

Why geometric, not arithmetic?¶

Arithmetic spacing (\(\omega_k = k \cdot \text{step}\)) gives a small dynamic range. With \(d = 64\), an arithmetic step of \(1/64\) gives \(\omega\) in \(\{1/64, 2/64, \ldots, 1\}\). Two positions one step apart and two positions far apart see very similar phase changes at every frequency.

Geometric spacing covers many scales. At the highest frequency (\(\omega_0 = 1\)), a position change of 1 causes a phase change of 1 radian — very sensitive to local position. At the lowest (\(\omega_{d/2-1} \approx 1/10000\)), a position change of 1 causes phase change of 1e-4 — encodes very-long-range information.

This multi-scale property is what lets a single PE represent both "the very next token" and "approximately 1000 tokens ago" in the same vector.

Connection to Fourier basis¶

This isn't accidental: the geometric set of sinusoids at multiple frequencies is a (truncated) Fourier basis. The token's position is being expressed as a vector of Fourier coefficients of an impulse at \(p\). The model can extract any band of frequencies by linear projection — i.e., it can extract any spatial-scale feature.

If you've worked with image processing, this is the same logic as multi-resolution wavelets / Fourier decompositions: information at different scales encoded in different basis components.

Design choice 3 — base \(10000\)¶

Why \(10000\), specifically?

The choice sets the wavelength range. The longest wavelength (at \(k = d/2 - 1\)) is \(2\pi \cdot 10000 \approx 63000\). The shortest is \(2\pi \approx 6.3\).

For sequences of length \(T \leq 10000\), no two distinct positions have the same PE (informally — the longest wavelength can distinguish positions up to ~10000 apart).

The choice is somewhat arbitrary — papers using base 5000 or base 50000 exist. For our Mini-GPT with \(T = 128\), base \(10000\) is overkill (we don't need to distinguish positions 5000 and 9999), but it doesn't hurt.

If you really wanted to "tune" the base, you could set it so that the wavelength range matches your max sequence length: \(\theta_\text{base} \approx T_\text{max} / (2\pi)\). Don't bother for Phase 17.

The dot-product-decay-with-distance property¶

A subtle but important property: the dot product \(\text{PE}(0) \cdot \text{PE}(p)\) decays as \(p\) increases (with oscillation). This means embeddings of nearby positions look more similar than embeddings of distant positions — the natural "nearby = related" inductive bias.

Sketch of why: \(\text{PE}(0) \cdot \text{PE}(p) = \sum_k [\sin(0) \sin(p \omega_k) + \cos(0) \cos(p \omega_k)] = \sum_k \cos(p \omega_k)\). The sum of cosines at \(d/2\) different frequencies, each oscillating with different periods. For large \(p\), the cosines decorrelate (they're at all phases), and the sum approaches the average, which is approximately \(0\) for large \(d\). For small \(p\), all cosines are near \(1\), so the sum is large (close to \(d/2\)).

The decay is not monotone — there are oscillations as different frequencies come into phase. Lab 01 plots this.

🇪🇸 Conclusión sobre PE sinusoidal: es una decisión de diseño defendible — multi-escala, paramless, y con la propiedad de cambio-lineal-bajo-traslación que permite al modelo aprender atención por posición relativa. No es la mejor opción para extrapolación (eso es RoPE), pero es históricamente el primer intento serio y sigue siendo razonable.

Worked example: \(T = 4, d = 8\)¶

Frequencies: \(\omega_0 = 1, \omega_1 = 1/10, \omega_2 = 1/100, \omega_3 = 1/1000\). (At \(d = 8\): \(\omega_k = 1/10000^{2k/8} = 1/10000^{k/4}\), giving \(1, 10^{-1}, 10^{-2}, 10^{-3}\). Approximate values; exact values come from \(\omega_k = 1/10^{k}\) here.)

\(p\)	sin(\(p\omega_0\))	cos(\(p\omega_0\))	sin(\(p\omega_1\))	cos(\(p\omega_1\))	sin(\(p\omega_2\))	cos(\(p\omega_2\))	cos(\(p\omega_3\))
0	0.00	1.00	0.00	1.00	0.00	1.00	1.00
1	0.84	0.54	0.10	1.00	0.01	1.00	1.00
2	0.91	-0.42	0.20	0.98	0.02	1.00	1.00
3	0.14	-0.99	0.30	0.96	0.03	1.00	1.00

The high-frequency dimensions (left) oscillate rapidly. The low-frequency dimensions (right) barely change over \(p = 0..3\). Each row is the 8-dim PE vector for that position.

Lab 01 has you generate this for \(T = 64, d = 32\) and visualize as a heatmap.

Implementation (locked for `src/minimodel/positional/sinusoidal.py`)¶

def sinusoidal_pe(T: int, d: int) -> np.ndarray:
    """Sinusoidal positional encoding, shape (T, d).

    Following Vaswani et al. 2017.
    """
    assert d % 2 == 0, "d must be even for sin/cos pairing"
    positions = np.arange(T, dtype=np.float32)[:, None]    # (T, 1)
    k = np.arange(d // 2, dtype=np.float32)                # (d/2,)
    omega = 1.0 / (10000.0 ** (2 * k / d))                 # (d/2,)
    angles = positions * omega                              # (T, d/2)
    pe = np.zeros((T, d), dtype=np.float32)
    pe[:, 0::2] = np.sin(angles)
    pe[:, 1::2] = np.cos(angles)
    return pe

Six lines. Borja writes this in lab 01.

What this file does NOT cover¶

Learned PE. Next file (02-learned-vs-sinusoidal.md).
Rotary (RoPE). 03-rope.md.
ALiBi, T5-bias. Mentioned in passing in file 02; not derived.
Concatenation vs addition. We add. Concatenation reserves dimensions; addition shares them. The choice is "convention" — derivation belongs in a thesis, not here.
Tuning the base \(\theta = 10000\). Phase 17 leaves it at the default. Tuning is a research topic.

Recap¶

Sinusoidal PE: \((2k, 2k+1)\) paired, sin/cos at frequency \(\omega_k = 1 / 10000^{2k/d}\).
Three design choices: sin/cos pair (linear-shift property), geometric frequencies (multi-scale), base \(10000\) (wavelength range).
Linear-shift: \(\text{PE}(p + \Delta)\) is a fixed-rotation of \(\text{PE}(p)\), enabling implicit relative-position reasoning.
Multi-scale frequencies = (truncated) Fourier basis at the position.
Dot-product decay with distance: nearby positions look more similar.

Next: 02-learned-vs-sinusoidal.md.