Skip to content

English · Español

Break 00 — Forget to apply the causal mask in mini-GPT's attention

🇪🇸 Quitamos la máscara causal del bloque de atención. El modelo ahora ve el futuro durante el entrenamiento: para predecir el token t, mira los tokens t+1, t+2, .... Predicción: la pérdida cae a casi 0 en 5 pasos. Validación-cross-entropy también baja (el split val tiene la misma fuga). Al inferir token-por-token (sin futuro disponible), el modelo está perdido — predice basura, colapsa al token BoS. Es uno de los errores más recurrentes de transformers en el mundo real.

Anchors: LYNX_CORTEX.md §4 / PHASE 17; theory §01 transformer block; Phase 15 §04 masking; .claude/commands/break.md.


The break

In src/minimodel/nn/transformer_block.py:

class TransformerBlock(Module):
    def forward(self, x: Tensor, causal_mask: Tensor) -> Tensor:
        h = self.norm1(x)
        # BUG: passing None instead of the causal_mask.
        attn_out = self.attention(h, h, h, mask=None)   # was: mask=causal_mask
        x = x + attn_out
        x = x + self.ffn(self.norm2(x))
        return x

Single-line edit. The mask is constructed (still passed to the block), but the attention module doesn't apply it.

Predict, then run

What happens during training

Causal masking prevents token t from attending to tokens t+1, ..., T-1. Without it:

  • At position t, the model attends to the entire sequence including the answer.
  • For LM, the answer at position t IS the next token (t+1). So position t can literally just attend to position t+1's embedding.
  • Training loss drops to near 0 in <10 steps.

What happens at inference

At inference time, you can only show the model tokens you've already generated. The model has learned a "trick" — "look at the right neighbor" — that doesn't apply when there is no right neighbor (the future hasn't been generated yet).

The model collapses. Specifically:

  • It typically predicts the start-of-sequence (BoS) token for every position, because at training time when the model couldn't peek (i.e., the first few tokens), BoS was a frequent answer. Phase 30 calls this "BoS collapse".
  • Or it produces gibberish (random tokens) — the attention is doing nothing useful.

Predictions

  1. Train loss: drops to ~1e-4 in <10 steps. (Looks too good to be true — that's the diagnostic.)
  2. Val loss (computed in train mode, with future visible): also drops to near 0.
  3. Val loss (computed in true autoregressive inference mode): stays at log(V) ≈ 6.24 or worse.
  4. Sampling output: same token repeated, or BoS-collapse.

Write predictions in learners/borja/phase-17/notes/breaks.md before running.

Observe

just exp 17-train-mini --tag broken-no-mask
just exp 17-sample --tag broken-no-mask  # autoregressive sampling

Diagnostics:

  1. Plot train loss — if it drops below the irreducible entropy floor H(corpus) (which Phase 5 §02 derives), suspect data leakage / mask issues immediately.
  2. At inference, generate 10 sentences. Should be coherent §A13 conjugations; will be garbage.
  3. Visualize attention matrices on the trained model: each row should be upper-triangular zeros (causal). If you see attention to future positions, the mask is missing.

Symptom Borja will see

  • Train loss < 1e-3 (suspiciously low — below the irreducible entropy).
  • Sampling produces <BoS> <BoS> <BoS> ... or random gibberish.
  • Attention matrix at any layer is full (not upper-triangular zeros).

Hidden cause (one sentence)

The TransformerBlock.forward passes mask=None to attention instead of mask=causal_mask, so every position can attend to every other position (including future ones).

Hint cascade

  1. Print the attention matrix attn from layer 0 on a fixed input. What's its sparsity pattern? Should it be upper-triangular?
  2. The train loss is suspiciously low. What is the irreducible entropy of the §A13 corpus (Phase 5 §02 derives it)? Is the model performing better than that?
  3. Generate text with model.sample(prompt="I will"). If it outputs BoS repeatedly, the model is collapsing — common sign of training-vs-inference distribution mismatch.

Fix diff

attn_out = self.attention(h, h, h, mask=causal_mask)

Why this teaches the concept

The causal mask is the distinguishing feature of decoder-only transformers like GPT — without it, you have BERT (encoder-only, bidirectional). Forgetting the mask is the single most common "this looks like it's working in training but breaks in production" bug in transformer engineering. The break is intentionally subtle in train-mode diagnostics (loss is just too low) but flagrant in sample-mode (model produces garbage). This is exactly the kind of bug Phase 19's "training looks fine" dashboard is meant to catch — and Phase 20's eval harness must check autoregressive generation, not just teacher-forced loss. This break ties Phase 17 to Phases 19–20.


Series complete: Phase 09 → Phase 17 break exercises.