English · Español

05 — Full attention computation: length-4 sequence with numerical values¶

🇪🇸 Hacemos toda la atención de cabeza única para una secuencia de longitud 4 con números reales. Q, K, V salen de un embedding §A13. Cada paso — QK^T, escalado, máscara, softmax, multiplicación por V — lo escribimos con valores concretos. Después calculamos ∂L/∂Q por la regla de la cadena. Esto convierte la sección §02 (derivación general) en algo numérico que cabe en una hoja.

Anchors: LYNX_CORTEX.md §4 / PHASE 15; theory §02 scaled dot-product; lab §00 attention by hand; Phase 27 (Flash attention will reuse this exact arithmetic).

Setup — concrete tensors¶

Sequence: ["I", "will", "work", "."]. Embedded to d_model = 4. We'll use a single-head, single-layer attention with d_k = d_v = 4.

import numpy as np
np.set_printoptions(precision=3, suppress=True)

X = np.array([
    [ 0.50,  0.30, -0.20,  0.10],   # "I"
    [-0.10,  0.40,  0.20, -0.30],   # "will"
    [ 0.20, -0.10,  0.50,  0.10],   # "work"
    [ 0.00,  0.00,  0.00,  0.20],   # "."
])

# Tiny W_Q, W_K, W_V (identity-ish for clarity)
W_Q = np.eye(4) * 0.5
W_K = np.eye(4) * 0.5
W_V = np.eye(4) * 1.0

Compute Q, K, V:

Q = X @ W_Q = 0.5 · X
K = X @ W_K = 0.5 · X
V = X @ W_V = X

So Q = K = 0.5 · X, V = X. (Choosing identity weights makes the arithmetic legible.)

Step 1 — `Q K^T`¶

Q @ K.T = (0.5 · X)(0.5 · X).T = 0.25 · X X.T

Compute X X.T (a 4×4 matrix of dot products):

       I       will    work     .
I   |  0.39   -0.04   -0.01    0.02  |
will| -0.04    0.30    0.04   -0.06  |
work| -0.01    0.04    0.31    0.02  |
 .  |  0.02   -0.06    0.02    0.04  |

(Sanity check: X[0] · X[0] = 0.25 + 0.09 + 0.04 + 0.01 = 0.39 ✓.)

Q K^T = 0.25 · X X.T:

       I       will    work     .
I   |  0.098  -0.010  -0.003   0.005  |
will| -0.010   0.075   0.010  -0.015  |
work| -0.003   0.010   0.078   0.005  |
 .  |  0.005  -0.015   0.005   0.010  |

Step 2 — Scale by `sqrt(d_k) = 2`¶

S = (Q K^T) / sqrt(d_k) = (Q K^T) / 2

       I       will    work     .
I   |  0.049  -0.005  -0.002   0.003  |
will| -0.005   0.038   0.005  -0.008  |
work| -0.002   0.005   0.039   0.003  |
 .  |  0.003  -0.008   0.003   0.005  |

These are the pre-softmax logits. Because we used identity-ish weights and small embeddings, they're all near 0. (Phase 17's real model will have larger spread.)

Step 3 — Causal mask¶

Set the upper triangle to -∞:

       I       will    work     .
I   |  0.049   -inf    -inf    -inf   |
will| -0.005   0.038   -inf    -inf   |
work| -0.002   0.005   0.039   -inf   |
 .  |  0.003  -0.008   0.003   0.005  |

This makes the softmax in step 4 yield 0 for masked positions.

Step 4 — Row-wise softmax¶

For each row, subtract the row max (for stability), exponentiate, normalize.

Row 0 ("I"): Only one unmasked entry, 0.049. Softmax = [1, 0, 0, 0].

Row 1 ("will"):

logits = [-0.005, 0.038, -inf, -inf]
shifted = logits - 0.038 = [-0.043, 0, -inf, -inf]
exp     = [exp(-0.043), 1, 0, 0] = [0.958, 1.0, 0, 0]
sum     = 1.958
softmax = [0.489, 0.511, 0, 0]

Row 2 ("work"):

logits = [-0.002, 0.005, 0.039, -inf]
shifted = logits - 0.039 = [-0.041, -0.034, 0, -inf]
exp     = [0.960, 0.967, 1.0, 0]
sum     = 2.927
softmax = [0.328, 0.330, 0.342, 0]

Row 3 ("."):

logits = [0.003, -0.008, 0.003, 0.005]
shifted = logits - 0.005 = [-0.002, -0.013, -0.002, 0]
exp     = [0.998, 0.987, 0.998, 1.0]
sum     = 3.983
softmax = [0.250, 0.248, 0.250, 0.251]

(Approximately uniform — the differences in pre-softmax logits are too small to make the distribution sharp. Phase 15 §02 derives why: at init, attention is roughly uniform; learning sharpens it.)

Attention matrix A:

A = | 1.000  0      0      0     |
    | 0.489  0.511  0      0     |
    | 0.328  0.330  0.342  0     |
    | 0.250  0.248  0.250  0.251 |

Step 5 — `A @ V`¶

V = X. Compute Y = A V:

Row 0 ("I"): Y[0] = 1.0 · X[0] = X[0] = [0.50, 0.30, -0.20, 0.10]. Just itself.

Row 1 ("will"): weighted sum of X[0] and X[1]:

Y[1] = 0.489·X[0] + 0.511·X[1]
     = 0.489·[0.5, 0.3, -0.2, 0.1] + 0.511·[-0.1, 0.4, 0.2, -0.3]
     = [0.2445 - 0.0511, 0.1467 + 0.2044, -0.0978 + 0.1022, 0.0489 - 0.1533]
     ≈ [0.194, 0.351, 0.004, -0.104]

Row 2 ("work"):

Y[2] ≈ 0.328·X[0] + 0.330·X[1] + 0.342·X[2]
     ≈ [0.164, 0.099, -0.066, 0.033] + [-0.033, 0.132, 0.066, -0.099] + [0.068, -0.034, 0.171, 0.034]
     ≈ [0.199, 0.197, 0.171, -0.032]

Row 3 ("."): approximate average of all four X rows.

Y[3] ≈ 0.250·X[0] + 0.248·X[1] + 0.250·X[2] + 0.251·X[3]
     ≈ [0.150, 0.155, 0.123, 0.020]

Step 6 — Output projection and loss¶

Suppose the goal is to predict the next token (LM). The LM head gives logits over vocab; loss is cross-entropy. We won't expand it here — Phase 17 does — but assume ∂L/∂Y[3] = g_3 (the gradient at the last position).

Backward gradient — through softmax¶

We want ∂L/∂Q. The full chain is:

Y = A V, so ∂L/∂A = (∂L/∂Y) V^T and ∂L/∂V = A^T (∂L/∂Y).
A = softmax(S) (row-wise). The Jacobian of softmax per row i is diag(A_i) - A_i A_i^T. So ∂L/∂S = A ⊙ (∂L/∂A - (∂L/∂A · A^T) · ⊙_rows) — the standard softmax backward formula.
S = (Q K^T) / sqrt(d_k), so ∂L/∂Q = (∂L/∂S) K / sqrt(d_k).

The headline shape arithmetic for our T = 4, d_k = 4 case:

∂L/∂Y : (4, 4)        [from cross-entropy]
∂L/∂A : (4, 4) = (∂L/∂Y) V.T = (4,4) @ (4,4)
∂L/∂V : (4, 4) = A.T @ (∂L/∂Y) = (4,4) @ (4,4)
∂L/∂S : (4, 4) = softmax_backward(A, ∂L/∂A)
∂L/∂Q : (4, 4) = (∂L/∂S) @ K / sqrt(d_k)
∂L/∂K : (4, 4) = (∂L/∂S).T @ Q / sqrt(d_k)

Critical: the /sqrt(d_k) survives backward — it appears in ∂L/∂Q and ∂L/∂K. If you forget to scale in forward, you lose this factor in backward too — Phase 15's /break exposes this.

What the numerical example shows¶

At init, attention is approximately uniform. Each row of A has entries in [0.24, 0.51]. The "look at me" pattern doesn't exist yet.
The causal mask zeros future contributions. Row 0 attends only to itself; row 1 to positions 0–1; etc.
The model can do anything by adjusting W_Q, W_K. With random init, attention is uniform; training will sharpen it into the patterns we see in pre-trained models (Phase 17's lab visualizes them).
Softmax preserves shape (T, T). The dimensions don't change inside attention — only at the A V step where we contract on the T (key) axis.

Compute budget for length-4 attention¶

Q K^T            : 4 · 4 · 4 = 64 muls
scale            : 16 ops
softmax (per row): 4 exp + 4 sum + 4 div = ~32 ops per row × 4 = 128
mask             : 6 sets to -inf
A V              : 4 · 4 · 4 = 64 muls

Total           : ~270 muls + ~150 small ops + 4 exponentials per row

For length-32 (a realistic §A13 sentence): T^2 d_k = 1024 · 4 = 4096 muls per matrix — manageable on CPU. For length-2048 (a small story): 2048^2 · 64 = 268M muls — needs Phase 23 GPU. This is where Phase 22's KV cache and Phase 27's Flash attention become critical.

Citations¶

Vaswani, A. et al. 2017. "Attention is All You Need." arXiv:1706.03762. Section 3.2 derives attention; §3.2.1 explains the sqrt(d_k) scaling.
The numerical worked-out example follows the pattern in Sasha Rush's "The Annotated Transformer" (https://nlp.seas.harvard.edu/2018/04/03/attention.html) but is independently computed for §A13.

One-paragraph recap¶

A length-4, single-head attention pass over §A13 embeddings computes Q K^T / sqrt(d_k), applies a causal mask, takes row-wise softmax to yield A, then Y = A V. With identity-ish init the attention matrix is approximately uniform across allowed positions: row 0 = self; row 1 = 50/50; row 3 = ~uniform over 4. The backward gradient ∂L/∂Q = (∂L/∂S) K / sqrt(d_k) preserves the scaling factor — forgetting it in forward also breaks backward. Compute cost is O(T² d) for the attention block; at T = 4 it's a few hundred ops, at T = 2048 it's the bottleneck Phase 22 and Phase 27 address.

Prev: 04-masking.md Next: Phase 16 (positional encodings).