English · Español
05 — Full attention computation: length-4 sequence with numerical values¶
🇪🇸 Hacemos toda la atención de cabeza única para una secuencia de longitud 4 con números reales.
Q,K,Vsalen de un embedding §A13. Cada paso —QK^T, escalado, máscara, softmax, multiplicación porV— lo escribimos con valores concretos. Después calculamos∂L/∂Qpor la regla de la cadena. Esto convierte la sección §02 (derivación general) en algo numérico que cabe en una hoja.Anchors:
LYNX_CORTEX.md§4 / PHASE 15; theory §02 scaled dot-product; lab §00 attention by hand; Phase 27 (Flash attention will reuse this exact arithmetic).
Setup — concrete tensors¶
Sequence: ["I", "will", "work", "."]. Embedded to d_model = 4. We'll use a single-head, single-layer attention with d_k = d_v = 4.
import numpy as np
np.set_printoptions(precision=3, suppress=True)
X = np.array([
[ 0.50, 0.30, -0.20, 0.10], # "I"
[-0.10, 0.40, 0.20, -0.30], # "will"
[ 0.20, -0.10, 0.50, 0.10], # "work"
[ 0.00, 0.00, 0.00, 0.20], # "."
])
# Tiny W_Q, W_K, W_V (identity-ish for clarity)
W_Q = np.eye(4) * 0.5
W_K = np.eye(4) * 0.5
W_V = np.eye(4) * 1.0
Compute Q, K, V:
So Q = K = 0.5 · X, V = X. (Choosing identity weights makes the arithmetic legible.)
Step 1 — Q K^T¶
Compute X X.T (a 4×4 matrix of dot products):
I will work .
I | 0.39 -0.04 -0.01 0.02 |
will| -0.04 0.30 0.04 -0.06 |
work| -0.01 0.04 0.31 0.02 |
. | 0.02 -0.06 0.02 0.04 |
(Sanity check: X[0] · X[0] = 0.25 + 0.09 + 0.04 + 0.01 = 0.39 ✓.)
Q K^T = 0.25 · X X.T:
I will work .
I | 0.098 -0.010 -0.003 0.005 |
will| -0.010 0.075 0.010 -0.015 |
work| -0.003 0.010 0.078 0.005 |
. | 0.005 -0.015 0.005 0.010 |
Step 2 — Scale by sqrt(d_k) = 2¶
I will work .
I | 0.049 -0.005 -0.002 0.003 |
will| -0.005 0.038 0.005 -0.008 |
work| -0.002 0.005 0.039 0.003 |
. | 0.003 -0.008 0.003 0.005 |
These are the pre-softmax logits. Because we used identity-ish weights and small embeddings, they're all near 0. (Phase 17's real model will have larger spread.)
Step 3 — Causal mask¶
Set the upper triangle to -∞:
I will work .
I | 0.049 -inf -inf -inf |
will| -0.005 0.038 -inf -inf |
work| -0.002 0.005 0.039 -inf |
. | 0.003 -0.008 0.003 0.005 |
This makes the softmax in step 4 yield 0 for masked positions.
Step 4 — Row-wise softmax¶
For each row, subtract the row max (for stability), exponentiate, normalize.
Row 0 ("I"): Only one unmasked entry, 0.049. Softmax = [1, 0, 0, 0].
Row 1 ("will"):
logits = [-0.005, 0.038, -inf, -inf]
shifted = logits - 0.038 = [-0.043, 0, -inf, -inf]
exp = [exp(-0.043), 1, 0, 0] = [0.958, 1.0, 0, 0]
sum = 1.958
softmax = [0.489, 0.511, 0, 0]
Row 2 ("work"):
logits = [-0.002, 0.005, 0.039, -inf]
shifted = logits - 0.039 = [-0.041, -0.034, 0, -inf]
exp = [0.960, 0.967, 1.0, 0]
sum = 2.927
softmax = [0.328, 0.330, 0.342, 0]
Row 3 ("."):
logits = [0.003, -0.008, 0.003, 0.005]
shifted = logits - 0.005 = [-0.002, -0.013, -0.002, 0]
exp = [0.998, 0.987, 0.998, 1.0]
sum = 3.983
softmax = [0.250, 0.248, 0.250, 0.251]
(Approximately uniform — the differences in pre-softmax logits are too small to make the distribution sharp. Phase 15 §02 derives why: at init, attention is roughly uniform; learning sharpens it.)
Attention matrix A:
Step 5 — A @ V¶
V = X. Compute Y = A V:
Row 0 ("I"): Y[0] = 1.0 · X[0] = X[0] = [0.50, 0.30, -0.20, 0.10]. Just itself.
Row 1 ("will"): weighted sum of X[0] and X[1]:
Y[1] = 0.489·X[0] + 0.511·X[1]
= 0.489·[0.5, 0.3, -0.2, 0.1] + 0.511·[-0.1, 0.4, 0.2, -0.3]
= [0.2445 - 0.0511, 0.1467 + 0.2044, -0.0978 + 0.1022, 0.0489 - 0.1533]
≈ [0.194, 0.351, 0.004, -0.104]
Row 2 ("work"):
Y[2] ≈ 0.328·X[0] + 0.330·X[1] + 0.342·X[2]
≈ [0.164, 0.099, -0.066, 0.033] + [-0.033, 0.132, 0.066, -0.099] + [0.068, -0.034, 0.171, 0.034]
≈ [0.199, 0.197, 0.171, -0.032]
Row 3 ("."): approximate average of all four X rows.
Step 6 — Output projection and loss¶
Suppose the goal is to predict the next token (LM). The LM head gives logits over vocab; loss is cross-entropy. We won't expand it here — Phase 17 does — but assume ∂L/∂Y[3] = g_3 (the gradient at the last position).
Backward gradient — through softmax¶
We want ∂L/∂Q. The full chain is:
Y = A V, so∂L/∂A = (∂L/∂Y) V^Tand∂L/∂V = A^T (∂L/∂Y).A = softmax(S)(row-wise). The Jacobian of softmax per rowiisdiag(A_i) - A_i A_i^T. So∂L/∂S = A ⊙ (∂L/∂A - (∂L/∂A · A^T) · ⊙_rows)— the standard softmax backward formula.S = (Q K^T) / sqrt(d_k), so∂L/∂Q = (∂L/∂S) K / sqrt(d_k).
The headline shape arithmetic for our T = 4, d_k = 4 case:
∂L/∂Y : (4, 4) [from cross-entropy]
∂L/∂A : (4, 4) = (∂L/∂Y) V.T = (4,4) @ (4,4)
∂L/∂V : (4, 4) = A.T @ (∂L/∂Y) = (4,4) @ (4,4)
∂L/∂S : (4, 4) = softmax_backward(A, ∂L/∂A)
∂L/∂Q : (4, 4) = (∂L/∂S) @ K / sqrt(d_k)
∂L/∂K : (4, 4) = (∂L/∂S).T @ Q / sqrt(d_k)
Critical: the /sqrt(d_k) survives backward — it appears in ∂L/∂Q and ∂L/∂K. If you forget to scale in forward, you lose this factor in backward too — Phase 15's /break exposes this.
What the numerical example shows¶
- At init, attention is approximately uniform. Each row of
Ahas entries in[0.24, 0.51]. The "look at me" pattern doesn't exist yet. - The causal mask zeros future contributions. Row 0 attends only to itself; row 1 to positions 0–1; etc.
- The model can do anything by adjusting
W_Q,W_K. With random init, attention is uniform; training will sharpen it into the patterns we see in pre-trained models (Phase 17's lab visualizes them). - Softmax preserves shape
(T, T). The dimensions don't change inside attention — only at theA Vstep where we contract on theT(key) axis.
Compute budget for length-4 attention¶
Q K^T : 4 · 4 · 4 = 64 muls
scale : 16 ops
softmax (per row): 4 exp + 4 sum + 4 div = ~32 ops per row × 4 = 128
mask : 6 sets to -inf
A V : 4 · 4 · 4 = 64 muls
Total : ~270 muls + ~150 small ops + 4 exponentials per row
For length-32 (a realistic §A13 sentence): T^2 d_k = 1024 · 4 = 4096 muls per matrix — manageable on CPU. For length-2048 (a small story): 2048^2 · 64 = 268M muls — needs Phase 23 GPU. This is where Phase 22's KV cache and Phase 27's Flash attention become critical.
Citations¶
- Vaswani, A. et al. 2017. "Attention is All You Need." arXiv:1706.03762. Section 3.2 derives attention; §3.2.1 explains the
sqrt(d_k)scaling. - The numerical worked-out example follows the pattern in Sasha Rush's "The Annotated Transformer" (https://nlp.seas.harvard.edu/2018/04/03/attention.html) but is independently computed for §A13.
One-paragraph recap¶
A length-4, single-head attention pass over §A13 embeddings computes Q K^T / sqrt(d_k), applies a causal mask, takes row-wise softmax to yield A, then Y = A V. With identity-ish init the attention matrix is approximately uniform across allowed positions: row 0 = self; row 1 = 50/50; row 3 = ~uniform over 4. The backward gradient ∂L/∂Q = (∂L/∂S) K / sqrt(d_k) preserves the scaling factor — forgetting it in forward also breaks backward. Compute cost is O(T² d) for the attention block; at T = 4 it's a few hundred ops, at T = 2048 it's the bottleneck Phase 22 and Phase 27 address.
Prev: 04-masking.md
Next: Phase 16 (positional encodings).