English · Español

01 — Query, Key, Value: Why Three Projections¶

🇪🇸 Tres matrices, no una. La Q dice "qué busco", la K dice "qué tengo", la V dice "qué entrego si me eligen". Separar las tres permite (a) atenciones asimétricas (i atiende a j ≠ j atiende a i) y (b) decoupling entre la representación que coincide y la información que se devuelve. Sin ese decoupling, el modelo solo podría devolver lo mismo que usa para coincidir, lo cual rompe la generalidad del mecanismo.

This file derives Q, K, V as projections of the input, and answers the question "why three matrices and not one or two?". This is one of the most-asked questions about attention; the answer rewards the time it takes to land.

Setup¶

We have an input sequence of $T$ tokens, each represented as a $d$-dimensional vector. Stack them as a matrix $X \in \mathbb{R}^{T \times d}$. (How $X$ was produced — token embedding + position encoding — is a Phase 16 question; here we take it as given.)

We want to produce an output sequence of $T$ vectors, where each output is a weighted combination of "value" vectors derived from $X$, with weights computed from some similarity between "query" and "key" vectors also derived from $X$.

The simplest version would be:

\[ \text{output}_i = \sum_{j=1}^{T} \text{sim}(x_i, x_j) \cdot x_j \]

This is untrainable: there's no learned parameters anywhere. The similarity and the value retrieval are both fixed at "raw input dot product". The model has no degrees of freedom.

To make it learnable, we project $X$ through three matrices:

\[ \boxed{\; Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V \;} \]

with $W_Q \in \mathbb{R}^{d \times d_k}$, $W_K \in \mathbb{R}^{d \times d_k}$, $W_V \in \mathbb{R}^{d \times d_v}$. Then

\[ \text{output}_i = \sum_{j=1}^{T} \text{sim}(Q_i, K_j) \cdot V_j \]

The three matrices give the model independent control over three things: what to ask for (via $W_Q$), what to advertise (via $W_K$), what to deliver (via $W_V$).

The dictionary-lookup analogy¶

A Python dict lookup d[key] is:

You provide a query (the key you're searching for).
The dict stores (key, value) pairs.
It returns the value whose key matches the query exactly.

Attention is the soft generalization of this:

The current position emits a query $Q_i$ — a vector encoding "what am I looking for here?".
Every position $j$ has a key $K_j$ — a vector encoding "this is what I am, for matching purposes".
Every position $j$ has a value $V_j$ — a vector encoding "this is what I will return if matched".
The dot product $Q_i \cdot K_j$ measures how well the query matches each key. Softmax normalizes to weights summing to 1. The output is the weighted sum of values.

Concretely, on the canonical 8-token sequence I work, you work, he ___, position 7 (the slot to fill) needs to know:

"What is the subject pronoun in this clause?" — its query asks for the subject's grammatical features.
Position 6 (he) has a key that advertises "I am a 3^rd-singular pronoun".
Position 6 has a value that delivers "agreement feature: -s suffix needed".

The query at position 7 dot-products with all keys; the key at position 6 matches best; the softmax puts most weight on $j=6$; the output at position 7 inherits the value from position 6 — "use the -s form of the verb".

The matching feature (what's in $K_6$: "3^rd-singular pronoun") is different from the delivered information (what's in $V_6$: "verb agreement = -s"). That's the whole point of having $K$ and $V$ be different projections.

Why three? Four arguments, increasing in subtlety¶

Argument 1 — Asymmetric attention¶

Self-attention is directional: position $i$'s attention to position $j$ does not equal position $j$'s attention to position $i$.

If we used only one projection (call it $W$), then $\text{sim}(W x_i, W x_j) = (W x_i) \cdot (W x_j) = x_i^\top W^\top W x_j$. This is symmetric in $i \leftrightarrow j$: position $i$ attends to position $j$ exactly as much as position $j$ attends to position $i$.

With separate $W_Q, W_K$: $$ Q_i \cdot K_j = x_i^\top W_Q^\top W_K x_j $$ The matrix $W_Q^\top W_K$ is not symmetric in general. So $i$-to-$j$ attention differs from $j$-to-$i$.

This matters linguistically: the relationship between a verb and its subject is asymmetric — the verb depends on the subject's person/number (so the verb's query asks "who's my subject?"), but the subject does not depend on the verb's form. In he works, position 7 (works) attends strongly to position 6 (he) — but position 6 (he) does not need to attend to position 7 to be itself. Attention has to model that asymmetry, and it can only do so with separate Q and K.

🇪🇸 Punto: sin Q y K separadas, attention sería simétrica. La asimetría del lenguaje (verbo→sujeto, modificador→modificado) la pide.

Argument 2 — V decouples content matched from information returned¶

If we used $V = K$ (i.e., values are keys), every position would deliver the same vector it advertises. This is fine if the value you want is identical to your matching feature — but usually it isn't.

Verb-grammar example: in he ___, the token he needs to be matched on its grammatical role ("I am a 3^rd-singular pronoun"), but the value it should deliver is the agreement signal it imposes on the verb ("you need the -s suffix"). The matching criterion (a feature of the pronoun) is different from the content delivered (an instruction to the verb).

Separate $K$ and $V$ project the same input $x_6$ into the two different spaces:

$K_6 = x_6 W_K$ encodes "3^rd-singular pronoun, here I am".
$V_6 = x_6 W_V$ encodes "if you matched me, apply -s to your verb".

The model learns both projections during training. $W_K$ learns to put 3^rd-singular pronouns in a region of key-space that 3^rd-singular-verb-queries can find. $W_V$ learns to put "-s" in the value-space they deliver.

This decoupling is what makes attention an information-routing primitive rather than a similarity-based clustering: the model can match on one feature and deliver on a completely different one.

Without separate V: the attention layer could only return information from positions that already look like what the query is looking for. Useless for cross-feature routing.

Argument 3 — Dimensionality flexibility¶

$Q$ and $K$ must have the same dimension (you need to dot-product them). $V$ can have a different dimension $d_v$. In multi-head (next theory file), having $d_v$ independent from $d_k$ is convenient — you can vary "how detailed the similarity computation is" independently of "how much information each position delivers".

In practice, $d_k = d_v$ is the default. But the API surface allows the separation, and frameworks like PyTorch's nn.MultiheadAttention expose it.

Argument 4 — Learned attention patterns require expressive enough $W_Q W_K^\top$¶

Here's the deepest reason. Attention weights are $\text{softmax}(QK^\top / \sqrt{d_k})$. Pre-softmax, the scores are $Q K^\top = X W_Q W_K^\top X^\top$. The matrix $W_Q W_K^\top \in \mathbb{R}^{d \times d}$ — call it $M$ — is the bilinear form the model uses to score $(x_i, x_j)$ pairs.

If $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ with $d_k = d$, then $M$ has full rank in general. The model can express any bilinear scoring function. With $d_k < d$, $M$ is rank-$d_k$ — restricted, but often sufficient.

If you collapsed $W_Q$ and $W_K$ into a single matrix $W$, you'd get $M = W^\top W$, which is PSD (positive semi-definite). That's a major restriction: PSD matrices have specific structure (positive eigenvalues, symmetric). Many useful attention patterns require non-PSD $M$. Separate $W_Q, W_K$ lift this restriction.

This argument is the most rigorous; for a learner it can be hand-waved as "separate matrices = more expressive scoring".

Worked example with $T = 2, d = 2, d_k = 2$¶

Let

\[ X = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \]

(two tokens; their embeddings are unit vectors in different directions — think of row 0 as "subject pronoun" and row 1 as "verb stem").

Let

\[ W_Q = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \quad W_K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} \quad W_V = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix} \]

Then

\[ Q = X W_Q = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad K = X W_K = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}, \quad V = X W_V = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix} \]

Scores: $Q K^\top = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$. Note this is not symmetric in general — here it happens to be, because of the toy choice of $W_K$.

Without scaling: $\text{softmax}(\text{row 0}) = \text{softmax}(0, 1) = (0.27, 0.73)$.

Output row 0 = $0.27 \cdot V_0 + 0.73 \cdot V_1 = 0.27 \cdot (2, 0) + 0.73 \cdot (0, 3) = (0.54, 2.19)$.

This is the entire computation. Lab 00 has Borja do exactly this by hand on a slightly larger example, then match against NumPy.

🇪🇸 Ejercicio mental: nota cómo Q decide a quién mira la fila 0 (en este caso "el segundo token") y V decide qué le devuelven ("el segundo token tiene $V = (0, 3)$"). Si hubiéramos puesto $W_V = I$, la fila 0 habría devuelto simplemente $V_1 = (0, 1)$ — el mismo input. La separación V vs K es lo que permite que el output sea una transformación, no una copia.

Connecting to the Mini-GPT scale¶

In Phase 17's Mini-GPT we'll fix $d_\text{model} = 64, n_\text{heads} = 4$. The default convention sets $d_k = d_v = d_\text{model} / n_\text{heads} = 16$ per head. Each head therefore owns three matrices:

$W_Q \in \mathbb{R}^{64 \times 16}$
$W_K \in \mathbb{R}^{64 \times 16}$
$W_V \in \mathbb{R}^{64 \times 16}$

That's $3 \cdot 64 \cdot 16 = 3072$ parameters per head, $\times 4$ heads $= 12{,}288$ parameters total for the Q/K/V projections of one attention layer. Add the output projection $W_O \in \mathbb{R}^{64 \times 64}$ (4096 params) and you have $\sim 16{,}384$ parameters per attention layer. Phase 17's parameter inventory will revisit this.

In practice the four per-head matrices are concatenated and the projection is done in one matmul with a $W_{QKV} \in \mathbb{R}^{64 \times 192}$ tensor — same math, faster. We'll implement the concatenated form in Phase 15 lab 01.

A natural simplification — "tied weights" — does exist in some architectures:

$W_Q = W_K$: turns attention into a similarity-based clustering. Limits expressiveness (as Argument 4 showed). Used in some efficient-attention papers. Not standard.
$W_K = W_V$: ties the matching feature to the delivered information. Used in memory networks (pre-transformer). Not standard in modern transformers.
$W_Q = W_K = W_V$: turns attention into "weighted mean of $X$ by $X^\top X$". Mostly useless — same matrix can only route information to where it already looks like the input.

The default in every modern transformer is three independent matrices. Phase 15 follows this default.

Where this lands in the API¶

In src/minimodel/attention/BLUEPRINT.md, the API surface for MultiHeadAttention will own three weight matrices ($W_Q, W_K, W_V$) plus an output projection $W_O$ (covered in 03-multi-head.md). The constructor signature is:

class MultiHeadAttention:
    def __init__(self, d_model: int, n_heads: int, seed: int) -> None:
        # internally allocates W_Q, W_K, W_V (each d_model × d_model)
        # and W_O (d_model × d_model)

This API choice is locked by Phase 17 (Mini-GPT) — changing it cascades to every downstream phase. Read the blueprint carefully before implementing.

What this file does NOT cover¶

The softmax and the scaling factor $\sqrt{d_k}$. Next file (02-scaled-dot-product.md).
Why heads instead of one big projection. 03-multi-head.md.
Causal masking. 04-masking.md.
Initialization scale of $W_Q, W_K, W_V$. Phase 18 (training). For now, the labs use small Gaussian init (e.g. $\sigma = 0.02$) for forward-only verification.
Bias terms on the projections. Modern transformers (GPT-2 style) drop the biases on Q/K/V projections; we follow that convention. Justification is parameter-count / negligible-difference, not derived from first principles.

Recap¶

Three matrices exist for four reasons: asymmetry, content-vs-match decoupling, dimensional flexibility, full-rank scoring.
The most pedagogically important one is #2: separate V means attention is information routing, not similarity-based clustering. On he ___, the match is on "3^rd-singular pronoun" but the delivery is "apply -s to the verb" — two different functions of the same input.
The toy example shows the mechanism in 6 numbers. Lab 00 expands this.
API surface lives in src/minimodel/attention/BLUEPRINT.md. The three-matrix structure is locked.

Next: 02-scaled-dot-product.md.

01 — Query, Key, Value: Why Three Projections¶

Setup¶

The dictionary-lookup analogy¶

Why three? Four arguments, increasing in subtlety¶

Argument 1 — Asymmetric attention¶

Argument 2 — V decouples content matched from information returned¶

Argument 3 — Dimensionality flexibility¶

Argument 4 — Learned attention patterns require expressive enough \(W_Q W_K^\top\)¶

Worked example with \(T = 2, d = 2, d_k = 2\)¶

Connecting to the Mini-GPT scale¶

Where this lands in the API¶

What this file does NOT cover¶

Recap¶

01 — Query, Key, Value: Why Three Projections¶

Setup¶

The dictionary-lookup analogy¶

Why three? Four arguments, increasing in subtlety¶

Argument 1 — Asymmetric attention¶

Argument 2 — V decouples content matched from information returned¶

Argument 3 — Dimensionality flexibility¶

Argument 4 — Learned attention patterns require expressive enough \(W_Q W_K^\top\)¶

Worked example with \(T = 2, d = 2, d_k = 2\)¶

Connecting to the Mini-GPT scale¶

What if you really wanted to share matrices?¶

Where this lands in the API¶

What this file does NOT cover¶

Recap¶