English · Español
04 — Masking: Causal and Padding¶
🇪🇸 La máscara es lo que evita que el modelo "haga trampa" mirando al futuro durante el entrenamiento (máscara causal), y lo que evita que mire al relleno cuando las secuencias en un batch tienen distinta longitud (máscara de padding). Lo importante: la máscara se suma como \(-\infty\) antes del softmax, no se multiplica por cero después. Multiplicar después deja al gradiente "filtrarse" — es un bug clásico que sigue apareciendo en código de producción.
This file derives the two standard attention masks and shows the correct way to apply them.
Why mask anything¶
By default, every position in attention attends to every other position. For some tasks that's fine. For two important cases, it's not.
Case 1 — Causal language modeling (training)¶
We train a language model to predict token \(t+1\) given tokens \(1, \ldots, t\). If position \(t\)'s attention could read from position \(t+1, t+2, \ldots\), the model would have access to the answer during training. It would learn to copy from the future — trivially achieving 100% accuracy at training and 0% at inference.
Solution: prevent position \(i\) from attending to positions \(j > i\). Causal mask.
Case 2 — Variable-length batched sequences (padding)¶
In Phase 18's training loop, we'll batch sequences of different lengths. Shorter sequences are padded with a <PAD> token to match the longest in the batch. We don't want attention to attend to padding — those positions carry no information.
Solution: prevent any position from attending to padding positions. Padding mask.
Both masks combine into a single \((T \times T)\) matrix added pre-softmax.
The causal mask¶
For a sequence of length \(T\), the causal mask \(M^{\text{causal}}\) is a \(T \times T\) matrix:
So \(M^{\text{causal}}\) is lower-triangular with zeros on and below the diagonal, \(-\infty\) above.
Apply pre-softmax:
The \(-\infty\) entries push \(e^{s + (-\infty)} = 0\), so positions \(j > i\) get zero attention weight from query \(i\). They cannot influence position \(i\)'s output.
Why \(j \leq i\), not \(j < i\)?¶
Position \(i\) attends to positions \(0, 1, \ldots, i\). Including itself. This is essential — a token's representation must depend on itself; otherwise the layer just routes information from elsewhere and discards the token's own content.
The off-by-one is a very common bug. Verify in lab 02 with a perturbation test.
Numerical implementation¶
Don't literally use np.inf. Use a large negative number — -1e9 is standard. This avoids NaN if any reduction over the masked entries happens before softmax (which it shouldn't, but defensive coding).
def causal_mask(T: int) -> np.ndarray:
mask = np.triu(np.ones((T, T), dtype=np.float32), k=1) * -1e9
return mask # zeros on/below diag, -1e9 above
(np.triu with k=1 produces an upper-triangular matrix with zero below and on the diagonal, one above. Multiply by \(-10^9\) and we have the additive mask.)
The padding mask¶
Suppose your batch has sequences of length \(T_1, T_2, \ldots, T_B\), padded to \(T_{\max}\). For sequence \(b\), positions \(T_b, T_b + 1, \ldots, T_{\max} - 1\) are padding.
For each sequence in the batch, define the padding mask:
Combine with causal:
(Addition of two additive masks: a position is masked if either mask wants to mask it.)
Phase 15 doesn't use batches — we attend to single sequences — so we don't implement the padding mask here. It's documented for completeness. Phase 18's training loop will add it.
The critical mistake: multiplicative masking after softmax¶
A naive implementation might compute attention without a mask, then multiply by a 0/1 mask afterwards:
# WRONG
attn = softmax(scores) # full attention, all positions
attn = attn * mask_01 # zero out forbidden positions
out = attn @ V
This is broken in two ways:
-
The remaining attention weights don't sum to 1. After zeroing, the rows of
attnsum to something \(< 1\). The output is the weighted sum with a "missing mass". The model's output is implicitly scaled down for positions that have many forbidden neighbors. Numerically not catastrophic, semantically wrong. -
Gradient leak. Even after zeroing the attention weights, the gradient \(\partial L / \partial \text{scores}_{ij}\) for forbidden positions is not zero. The softmax sees those positions during forward; their scores affect the normalization of the unmasked positions. Information from forbidden positions leaks through the normalization. For a causal LM, this is information about future tokens leaking back to past predictions — the exact failure we wanted to prevent.
The mathematically correct mask is additive \(-\infty\) pre-softmax. After the softmax, the forbidden entries are exactly zero, and they contribute zero gradient (because \(e^{-\infty} = 0\) has zero derivative w.r.t. anything).
🇪🇸 Regla: máscaras siempre aditivas, pre-softmax, con \(-\infty\) (o \(-10^9\)). Nunca multiplicativas post-softmax. Es un bug que aparece en código real con frecuencia preocupante.
Lab 02 — perturbation test for causal mask¶
The way to prove your causal mask works:
- Generate a random input \(X\) of length \(T\).
- Compute attention output \(Y = \text{Attention}(X)\).
- Make \(X' = X\) but change position \(T - 1\) (the last position) to something completely different.
- Compute \(Y' = \text{Attention}(X')\).
- Verify \(Y[0:T-1] = Y'[0:T-1]\) to within numerical precision. (The last position's change must not propagate back to earlier outputs.)
If positions before \(T - 1\) in \(Y\) differ from \(Y'\), your mask is wrong. Off-by-one is the typical culprit.
Special case: bidirectional attention (BERT-style)¶
For some models — encoders, BERT — every position attends to every position. No causal mask. We only mask padding.
The Mini-GPT in Phase 17 is decoder-only and causal. We use the causal mask. Bidirectional attention is documented here for vocabulary; we don't build it.
Sliding-window attention (one paragraph)¶
Some modern models (Longformer, Mistral) replace the causal mask with a windowed causal mask: position \(i\) attends to positions \(\max(0, i - w), \ldots, i\) for some window size \(w\). This drops complexity from \(O(T^2)\) to \(O(T w)\).
The mask construction is the same — additive, pre-softmax — just with more zeros zeroed out:
def windowed_causal_mask(T: int, window: int) -> np.ndarray:
mask = np.full((T, T), -1e9, dtype=np.float32)
for i in range(T):
mask[i, max(0, i - window + 1):i + 1] = 0.0
return mask
Not used in Phase 15. Mentioned because the term will come up.
What this file does NOT cover¶
- Padding mask implementation. Phase 18 (training, where batches arrive).
- Sliding-window / local attention. Phase 27 or out of scope. Sketched only.
- Block-sparse masks (BigBird, etc.). Out of scope.
- Mask construction for KV-cache inference. Phase 22+. The KV-cache changes the mask shape.
- Bidirectional / BERT-style attention. Out of scope. Decoder-only curriculum.
Recap¶
- Causal mask: additive lower-triangular zeros, \(-\infty\) above the diagonal. Position \(i\) attends to \(0, \ldots, i\) inclusive.
- Padding mask: \(-\infty\) on padding positions. Added to the causal mask.
- Always additive, always pre-softmax. Multiplicative post-softmax leaks gradient.
- Lab 02 verifies with a perturbation test.
- Phase 15 implements only the causal mask. Padding waits for Phase 18.
You've now read all five Phase 15 theory files. Before opening the lab:
- Write the full attention equation from memory.
- Reproduce the variance argument for \(\sqrt{d_k}\).
- Draw the API surface from
BLUEPRINT.md. - Sketch the causal mask matrix for \(T = 4\).
If any of these feel wobbly, re-read the relevant file.
Next: end of theory. Proceed to ../lab/00-attention-by-hand.md.