Skip to content

English · Español

Lab 00 — Permutation Equivariance, in Numbers

Goal: demonstrate empirically that attention without positional information is permutation-equivariant. Show that adding sinusoidal PE breaks this property.

Estimated time: 30–45 minutes.

Prereq: theory/00-motivation.md read. Phase 15's src/minimodel/attention/ exists.


What you produce

A directory experiments/16-permutation-equivariance/ containing:

  • demo.py — script that runs attention with and without PE on a 3-token sequence and shows what happens under permutation.
  • demo_output.txt — captured printout.
  • manifest.json.
  • README.md.

TODOs

Block A — set up

  • Use MultiHeadAttention(d_model=8, n_heads=1, seed=0) from Phase 15.
  • Input: 3 tokens forming a verb-grammar fragment — he, work, I (token IDs from your Phase 14 tokenizer; embed via the Phase 13 embedding). Stack as \(X \in \mathbb{R}^{3 \times 8}\).
  • Linguistic motivation: he work I is ungrammatical; I work he is also ungrammatical; only with positional info can the model prefer one ordering over another (or rather: prefer he works over works he later in Phase 18). Without PE, all orderings look the same.

Block B — without PE

  • Compute Y = mha.forward(X, mask=None).
  • Permute the input: X_perm = X[[2, 0, 1]] (swap tokens around).
  • Compute Y_perm = mha.forward(X_perm, mask=None).
  • Assert: np.allclose(Y_perm, Y[[2, 0, 1]], atol=1e-6).

This proves: the attention output on the permuted input equals the permutation of the attention output on the original input. Equivariance. The model has no way to tell which permutation it received.

  • Print Y and Y[[2, 0, 1]] and Y_perm side-by-side. Verify visually.

Block C — with PE

  • Use sinusoidal PE: pe = sinusoidal_pe(3, 8) (from Phase 16's src/minimodel/positional/sinusoidal.py).
  • Compute Y_pe = mha.forward(X + pe).
  • Compute Y_perm_pe = mha.forward(X_perm + pe).
  • Assert: not np.allclose(Y_perm_pe, Y_pe[[2, 0, 1]], atol=1e-3).

This proves: with PE, the model does distinguish the permutation. The output is no longer just a reordering.

  • Print the diff matrix Y_perm_pe - Y_pe[[2, 0, 1]] (it should have non-trivial entries — the PE has broken equivariance).

Block D — interpret

In README.md (1–2 paragraphs), answer:

  1. Why does the without-PE test pass? State the permutation-equivariance theorem in your own words and reference the 3-token example.
  2. Why does the with-PE test pass-with-difference? The PE rows are different for different positions; adding them to permuted tokens means each token now carries a position-specific signature that the un-permuted version wouldn't have.

Block E — manifest

{
  "experiment": "16-permutation-equivariance",
  "date": "YYYY-MM-DD",
  "seed": 0,
  "versions": { "python": "3.11.x", "numpy": "X.Y.Z" },
  "config": {
    "d_model": 8,
    "n_heads": 1,
    "T": 3,
    "pe_scheme_compared": "sinusoidal"
  },
  "results_summary": {
    "without_PE_equivariance_max_diff": null,
    "with_PE_equivariance_max_diff": null
  }
}

The without-PE diff should be < 1e-6. The with-PE diff should be > 1e-3.

Constraints

  • No new code in src/. Use existing MultiHeadAttention and sinusoidal_pe. This lab is a demonstration, not an implementation lab.
  • Seeded. Reproducible.

Stop conditions

Done when:

  1. All four files committed.
  2. Both assertions pass (one for equivariance without PE, one for non-equivariance with PE).
  3. README.md explains the result.

Pitfalls

  • Permutation index confusion. X[[2, 0, 1]] means "take row 2, row 0, row 1 in that order". Confirm by printing X and X[[2, 0, 1]] to make sure you understand.
  • Tolerance. 1e-6 for without-PE; 1e-3 for with-PE (the PE values are O(1), so the diff after attention is non-trivial).

When to consult solutions/

After all four files committed. Solution at solutions/00-permutation-equivariance-ref.md.


Next lab: 01-sinusoidal-pe.md.