English · Español

00 — Motivation: Attention is Permutation-Equivariant¶

🇪🇸 Si tomas la salida de attention y permutas la entrada, obtienes la misma salida (en el mismo orden permutado). Esto significa que attention, por sí sola, no distingue entre he works y works he. Necesitamos añadir información posicional al modelo. Cómo hacerlo — sumando, concatenando, rotando — es lo que esta fase resuelve.

This file proves attention is permutation-equivariant, makes the consequence concrete with a 3-token example, and sets up the three solutions (sinusoidal, learned, RoPE) that follow.

The claim¶

Let \(\pi\) be any permutation of \(\{1, 2, \ldots, T\}\). Let \(X\) be a sequence of \(T\) token embeddings, and \(X_\pi\) be the same tokens reordered according to \(\pi\) (i.e., \((X_\pi)_i = X_{\pi(i)}\)).

Claim: the attention output on \(X_\pi\) is the same as the attention output on \(X\), with the rows permuted by \(\pi\).

In symbols: \(\text{Attention}(X_\pi) = (\text{Attention}(X))_\pi\).

This is permutation-equivariance: permute the input, get the (same) permuted output. The function commutes with permutations.

Proof¶

Let \(A = \text{Attention}(X) \in \mathbb{R}^{T \times d}\). We want to show \(\text{Attention}(X_\pi) = A_\pi\).

Write attention with the permuted input:

\[ Q_\pi = X_\pi W_Q, \quad K_\pi = X_\pi W_K, \quad V_\pi = X_\pi W_V \]

Since \(W_Q, W_K, W_V\) don't depend on position, \((Q_\pi)_i = Q_{\pi(i)}\), and similarly for \(K_\pi, V_\pi\). So permuting the input rows just permutes the Q, K, V rows.

The score matrix: \((Q_\pi K_\pi^\top)_{ij} = Q_{\pi(i)} \cdot K_{\pi(j)} = S_{\pi(i), \pi(j)}\).

Softmax is applied row-wise; the row of \(S_\pi\) at index \(i\) is row \(\pi(i)\) of \(S\), with the columns also permuted. After softmax, \(A_\pi^{(\text{intermediate})}_{ij} = A^{(\text{intermediate})}_{\pi(i), \pi(j)}\).

The output: \((A_\pi^{(\text{intermediate})} V_\pi)_i = \sum_j A_{\pi(i), \pi(j)} V_{\pi(j)} = \sum_{j'} A_{\pi(i), j'} V_{j'}\) (re-indexing \(j' = \pi(j)\)).

So the \(i\)-th row of the output is the \(\pi(i)\)-th row of \(A\). The permuted-input output equals the permutation of the original output. Done.

A concrete 3-token example¶

Take \(T = 3\), \(d = 2\). Tokens at positions 1, 2, 3 with embeddings:

\[ x_1 = (1, 0), \quad x_2 = (0, 1), \quad x_3 = (1, 1) \]

Let \(W_Q = W_K = W_V = I\) (identity).

Then \(Q = K = V = X\). Scores \(S = QK^\top\):

\[ S = \begin{pmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{pmatrix} \]

After softmax (row-wise, no scaling because \(d = 2\) is small):

\[ A_{ij} = \text{softmax}(S_i)_j \]

Output \(Y = A V\).

Now swap positions 1 and 3 in the input: \(X' = [x_3, x_2, x_1] = [(1,1), (0,1), (1,0)]\). Compute \(Y'\).

By the permutation-equivariance proof: \(Y' = [y_3, y_2, y_1]\) — same values, reordered.

The model cannot tell which sequence it was given. If we want he works to be different from works he, we have to break this equivariance. That is the entire content of positional encoding.

🇪🇸 Resumen: sin posiciones, el modelo trata la secuencia como un conjunto. Para que distinga he works de works he, tenemos que inyectar información posicional. Cómo hacerlo, sin romper otras buenas propiedades de attention, es el problema que esta fase resuelve.

Why this matters for our verb-grammar task¶

Word order matters in language. Dog bites man and Man bites dog have the same words, different meaning. Without positional info, the model couldn't distinguish them.

For our specific microscopic scope (§A13: 20 verbs, 5 tenses, 3 persons), word order is the grammatical signal. Consider:

he works — subject + 3^rd-singular verb. Correct.
works he — same tokens, but order signals "imperative + addressee" reading. Different meaning.
I work, you work, he ___ — slot 7 must agree with token at position 6 (he). The slot's neighbor on the left is the relevant subject; a token at the same offset on the right would not be.

The model has to know which token came before which. Without positional encoding, attention treats the sequence as a bag of tokens — and a bag of tokens cannot distinguish "subject precedes verb" from "verb precedes subject". Person agreement, tense ordering (auxiliary + main verb: will work), and bilingual alignment (English↔Spanish pairings) all require position information.

Three solutions¶

The rest of Phase 16 is three answers to "how do we add position information?":

Solution 1 — Add to the embedding¶

The Vaswani 2017 transformer adds a positional vector to the token embedding before the attention layer:

\[ \tilde{x}_p = E[t_p] + \text{PE}(p) \]

where \(\text{PE}(p) \in \mathbb{R}^d\) is some function of position.

Two flavors: - Sinusoidal: \(\text{PE}\) is a fixed (untrained) function using sin/cos at varying frequencies. Derived in theory/01-sinusoidal.md. - Learned: \(\text{PE}\) is a lookup into a trainable \(T_\text{max} \times d\) matrix. Derived in theory/02-learned-vs-sinusoidal.md.

Both modify the input. Attention sees position information mixed into the embedding.

Solution 2 — Inject inside attention (RoPE)¶

Rotary Position Embedding (Su 2021) doesn't touch the embedding. Instead, it rotates Q and K inside the attention layer:

\[ Q'_p = \text{rotate}(Q_p, \omega \cdot p), \qquad K'_p = \text{rotate}(K_p, \omega \cdot p) \]

The rotation depends on position \(p\). The dot product \(\langle Q'_m, K'_n \rangle\) depends only on \(m - n\) — relative position emerges naturally. Derived in theory/03-rope.md.

V is untouched. The position is purely a property of how attention weights are computed.

Solution 3 — Bias the attention score (ALiBi, T5, ...)¶

Add a position-dependent bias to the attention score:

\[ S'_{ij} = Q_i \cdot K_j + b(i, j) \]

where \(b(i, j)\) depends on \(|i - j|\) (ALiBi) or learned per offset (T5-style). The bias makes nearby positions more attended to without modifying Q, K, V at all.

We mention these in theory/02 but don't implement them. RoPE has cleaner extrapolation; learned T5 biases are more limited.

Which one does Phase 17 use?¶

Spoiler from the spec (LYNX_CORTEX.md §4 PHASE 16): "Why modern LLMs use RoPE."

The decision is RoPE, contingent on: - The relative-position property numerically verifies in lab 02. - The extrapolation comparison in lab 03 shows RoPE generalizing better than sinusoidal beyond training length.

If RoPE proves to be too complex to integrate cleanly into Phase 17's transformer block, fall back to sinusoidal (still extrapolates better than learned). Either way, learned positional embeddings are not the choice — they fail at extrapolation by construction.

What's coming¶

theory/01-sinusoidal.md — derive the sinusoidal formula, prove the linear-shift property.
theory/02-learned-vs-sinusoidal.md — when each works, why both fail.
theory/03-rope.md — the rotation derivation, the relative-position property.
Three labs (one per scheme) + a fourth comparison lab.

Permutation-equivariance is the problem. The next three theory files are three solutions. Pick wisely; the Mini-GPT depends on the choice.

Next: 01-sinusoidal.md.