English · Español

Lab 03 — Causality by perturbation¶

Read theory/01-transformer-block.md (§"Causal masking — still required, even with RoPE"). Do not consult solutions/.

Objective¶

Verify that the causal mask in your Mini-GPT is correctly wired all the way from input to output: changing input token at position \(i\) should change output logits only at positions \(j \ge i\). Position-\(i\) inputs must not leak backward into positions \(j < i\). This is the sanity check that distinguishes a real autoregressive language model from a bidirectional one (BERT-style). Get it wrong and Phase 18's training will quietly cheat.

Background¶

In an autoregressive language model, the prediction at position \(i\) should depend only on positions \(0, 1, \ldots, i\). The causal mask enforces this by setting attention scores from position \(i\) to positions \(j > i\) to \(-\infty\) before softmax. Position 0 attends to position 0 only; position 7 attends to 0–7.

A standard sanity check: pick two input sequences that differ only at position \(k\). Run them both. The output at every position \(j < k\) should be identical (causality holds). The output at every position \(j \ge k\) may differ (and almost certainly will, except by coincidence).

Tasks¶

Task 1 — perturbation test¶

In tests/test_phase17_causality.py:

def test_causal_mask_holds_end_to_end():
    model = MiniGPT(config)
    tokens_a = np.array([3, 1, 4, 1, 5, 9, 2, 6])  # arbitrary 8 tokens
    tokens_b = tokens_a.copy()
    perturb_at = 5
    tokens_b[perturb_at] = (tokens_a[perturb_at] + 7) % config.vocab_size  # change

    logits_a = model(tokens_a)
    logits_b = model(tokens_b)

    # Causality: positions 0..(perturb_at-1) must be identical.
    for j in range(perturb_at):
        assert np.allclose(logits_a[j], logits_b[j], atol=1e-8), \
            f"causality broken at position {j} (perturbed at {perturb_at})"

    # Positions perturb_at..T-1 should differ at least somewhere.
    differs = False
    for j in range(perturb_at, len(tokens_a)):
        if not np.allclose(logits_a[j], logits_b[j], atol=1e-8):
            differs = True
            break
    assert differs, "perturbation had no downstream effect — model not connected"

This is a one-shot test. Run it.

Task 2 — sweep across positions¶

Repeat the test for every \(k\) from 1 to \(T-1\). Each pass should respect causality. Collect into a table:

Perturb position \(k\)	Positions 0..\(k-1\) identical?	Positions \(k\)..\(T-1\) differ?
1	✓ / ✗	✓ / ✗
2	...	...
...

All rows should be ✓ ✓. Any ✗ in column 2 is a fatal bug.

Task 3 — what fails when the mask is wrong?¶

For learning's sake, temporarily disable the causal mask in your MHA implementation (don't commit this change). Re-run Task 1. You should see that perturbing position 5 changes output at positions 0–4 — which is exactly what BERT-style models do, and exactly the bug an autoregressive model must not have.

Document what changed. Re-enable the mask. This step builds the intuition: the mask is doing real work.

Task 4 — verify with a longer sequence¶

Run the test at sequence length \(T = 32\) (the locked context_len). Perturb each of \(k = 1, 8, 16, 31\). Confirm causality at all four positions.

This catches RoPE bugs that show up only at longer sequences (e.g., a RoPE table that doesn't extend past 8).

Task 5 — gradient-causality test (forward-looking)¶

This task is optional — it requires the Phase 8 autograd to be wired through the model. If skipping, document why.

If you can: compute \(\partial \text{logits}_j / \partial \text{input\_embed}_i\) for \(i, j\) a few sample pairs.

For \(i > j\): gradient should be zero (causality on the gradient, not just the forward).
For \(i \le j\): gradient should generally be nonzero.

The gradient test is stronger than the perturbation test, because it catches subtle leakage that perturbation might miss by coincidence.

Measurements to capture¶

Sweep table (Task 2): all rows ✓ ✓.
Counterfactual diff (Task 3): document one case where the unmasked model leaked information backward.
Long-sequence sweep (Task 4): all four perturb positions respect causality.
(Optional) Gradient-causality result, if you got Task 5 working.

Save in experiments/<date>-phase-17-causality/manifest.json plus a CSV of the sweep.

Acceptance¶

test_phase17_causality.py exists and passes.
Sweep table populated for all \(k\) from 1 to \(T-1\).
Task 3 documented: unmasked model breaks causality; masked model does not.
\(T = 32\) test passes.
Lab notes contain one paragraph on why RoPE alone is not sufficient for causality.

Pitfalls to expect¶

Atol too loose. Use atol=1e-8 for the "identical" check, not 1e-5. Floating-point arithmetic gives bit-exact identity for the prefix when the mask is correct; if you need slack, you have a bug.
Mask applied at wrong stage. Common bug: mask applied after softmax instead of before. After softmax, the future tokens already contributed to the partition function — your "causal" model leaks. Always pre-softmax: add -inf to scores, then softmax.
Mask shape. Shape (T, T) with mask[i, j] = -inf if j > i else 0. Broadcasts against scores (n_heads, T, T) cleanly.
RoPE applied after the mask. Order matters: project to Q/K, apply RoPE, compute scores, apply mask, softmax, weighted sum of V, output proj. If RoPE is applied after the mask (sometimes seen in older code), you may corrupt the masking.

Next: Phase 18 — Training Loop, Mixed Precision Preview, Checkpointing (after /quiz 17 and /phase-report 17).