English · Español
Break 00 — Forget to apply the causal mask in mini-GPT's attention¶
🇪🇸 Quitamos la máscara causal del bloque de atención. El modelo ahora ve el futuro durante el entrenamiento: para predecir el token
t, mira los tokenst+1, t+2, .... Predicción: la pérdida cae a casi 0 en 5 pasos. Validación-cross-entropy también baja (el split val tiene la misma fuga). Al inferir token-por-token (sin futuro disponible), el modelo está perdido — predice basura, colapsa al token BoS. Es uno de los errores más recurrentes de transformers en el mundo real.Anchors:
LYNX_CORTEX.md§4 / PHASE 17; theory §01 transformer block; Phase 15 §04 masking;.claude/commands/break.md.
The break¶
In src/minimodel/nn/transformer_block.py:
class TransformerBlock(Module):
def forward(self, x: Tensor, causal_mask: Tensor) -> Tensor:
h = self.norm1(x)
# BUG: passing None instead of the causal_mask.
attn_out = self.attention(h, h, h, mask=None) # was: mask=causal_mask
x = x + attn_out
x = x + self.ffn(self.norm2(x))
return x
Single-line edit. The mask is constructed (still passed to the block), but the attention module doesn't apply it.
Predict, then run¶
What happens during training¶
Causal masking prevents token t from attending to tokens t+1, ..., T-1. Without it:
- At position
t, the model attends to the entire sequence including the answer. - For LM, the answer at position
tIS the next token (t+1). So positiontcan literally just attend to positiont+1's embedding. - Training loss drops to near 0 in <10 steps.
What happens at inference¶
At inference time, you can only show the model tokens you've already generated. The model has learned a "trick" — "look at the right neighbor" — that doesn't apply when there is no right neighbor (the future hasn't been generated yet).
The model collapses. Specifically:
- It typically predicts the start-of-sequence (BoS) token for every position, because at training time when the model couldn't peek (i.e., the first few tokens), BoS was a frequent answer. Phase 30 calls this "BoS collapse".
- Or it produces gibberish (random tokens) — the attention is doing nothing useful.
Predictions¶
- Train loss: drops to ~
1e-4in <10 steps. (Looks too good to be true — that's the diagnostic.) - Val loss (computed in train mode, with future visible): also drops to near 0.
- Val loss (computed in true autoregressive inference mode): stays at
log(V) ≈ 6.24or worse. - Sampling output: same token repeated, or BoS-collapse.
Write predictions in learners/borja/phase-17/notes/breaks.md before running.
Observe¶
just exp 17-train-mini --tag broken-no-mask
just exp 17-sample --tag broken-no-mask # autoregressive sampling
Diagnostics:
- Plot train loss — if it drops below the irreducible entropy floor
H(corpus)(which Phase 5 §02 derives), suspect data leakage / mask issues immediately. - At inference, generate 10 sentences. Should be coherent §A13 conjugations; will be garbage.
- Visualize attention matrices on the trained model: each row should be upper-triangular zeros (causal). If you see attention to future positions, the mask is missing.
Symptom Borja will see¶
- Train loss < 1e-3 (suspiciously low — below the irreducible entropy).
- Sampling produces
<BoS> <BoS> <BoS> ...or random gibberish. - Attention matrix at any layer is full (not upper-triangular zeros).
Hidden cause (one sentence)¶
The TransformerBlock.forward passes mask=None to attention instead of mask=causal_mask, so every position can attend to every other position (including future ones).
Hint cascade¶
- Print the attention matrix
attnfrom layer 0 on a fixed input. What's its sparsity pattern? Should it be upper-triangular? - The train loss is suspiciously low. What is the irreducible entropy of the §A13 corpus (Phase 5 §02 derives it)? Is the model performing better than that?
- Generate text with
model.sample(prompt="I will"). If it outputs BoS repeatedly, the model is collapsing — common sign of training-vs-inference distribution mismatch.
Fix diff¶
Why this teaches the concept¶
The causal mask is the distinguishing feature of decoder-only transformers like GPT — without it, you have BERT (encoder-only, bidirectional). Forgetting the mask is the single most common "this looks like it's working in training but breaks in production" bug in transformer engineering. The break is intentionally subtle in train-mode diagnostics (loss is just too low) but flagrant in sample-mode (model produces garbage). This is exactly the kind of bug Phase 19's "training looks fine" dashboard is meant to catch — and Phase 20's eval harness must check autoregressive generation, not just teacher-forced loss. This break ties Phase 17 to Phases 19–20.
Series complete: Phase 09 → Phase 17 break exercises.