English · Español

Break 00 — Shuffle the positional encodings across positions¶

🇪🇸 Tomamos las codificaciones posicionales sinusoidales y barajamos las filas. PE[t] ya no corresponde a la posición t, sino a una posición aleatoria. La atención sigue funcionando, pero el modelo recibe información posicional permutada. Predicción: la pérdida se queda en log(V) (no aprende nada), porque la posición ahora es ruido. Comparar contra "sin PE" muestra que alguna información posicional, incluso inconsistente, es estructuralmente diferente de ninguna.

Anchors: LYNX_CORTEX.md §4 / PHASE 16; theory §00 permutation-equivariance; theory §01 sinusoidal; .claude/commands/break.md.

The break¶

In src/minimodel/nn/positional.py:

class SinusoidalPositionalEncoding(Module):
    def __init__(self, max_len: int, d_model: int, seed: int = 0) -> None:
        super().__init__()
        pe = self._build_pe(max_len, d_model)  # shape (max_len, d_model)

        # BUG: shuffle the rows of pe.
        rng = np.random.default_rng(seed)
        perm = rng.permutation(max_len)
        pe = pe[perm]                                   # permute rows
        self.pe = Tensor(pe, requires_grad=False)

The PE values are still valid sinusoidal vectors; they're just attached to the wrong positions. The pattern at t = 3 might be the pattern that should be at t = 17.

Predict, then run¶

The model relies on PE to break attention's permutation-equivariance (theory §00). Sinusoidal PE encodes position via a specific phase relationship — PE[t] and PE[t+1] are "near" each other in cosine similarity. Shuffling destroys this neighbor-relation.

The attention then has positional information that is:

Per-position consistent (within a single forward pass, PE[3] is always the same vector).
Globally meaningless (position 3's vector might be similar to position 17's, not position 2 or 4).

This means:

Identical sentences get identical embeddings (so the model can technically memorize them).
But the model can't learn that "the next word depends on the position just before it" — because "just before" is no longer detectable from the PE.

Predictions¶

Train loss drops slowly as the model memorizes the train set.
Val loss stays near log(V) because the shuffled PE doesn't generalize — train examples have one shuffling, val examples have the same shuffling, but the model has learned f(token_3, pe_17) instead of f(token_3, position_3).
If you re-shuffle the PE between training and eval (a separate bug), it gets even worse — different PE values for the same positions.
Compared to "no PE at all": the broken model is at least as bad. Sometimes worse.

Write predictions in learners/borja/phase-16/notes/breaks.md before running.

Observe¶

just exp 16-train --tag broken-shuffled-pe

Diagnostics:

Plot train and val loss. Train should slowly memorize; val should plateau.
Compute attention matrices at the trained model on a held-out sentence. Compare to the unmodified model — the broken model's attention rows look random.
Compute the cosine similarity cos(PE[t], PE[t+1]) — for sinusoidal PE this is close to 1 (neighbors are similar). For shuffled PE, it's ~0.

Symptom Borja will see¶

Train loss decreases (memorization).
Val loss stays at log(V) ≈ log(512) = 6.24 or close.
Train/val gap is huge.
cos(PE[t], PE[t+1]) is ~0 (was ~0.99 in the un-broken case).

Hidden cause (one sentence)¶

pe = pe[perm] shuffled the positional encoding rows, destroying the smooth phase relationship that lets sinusoidal PE encode relative position.

Hint cascade¶

What property does the sinusoidal PE encode that makes nearby positions have similar vectors? Print cos(PE[t], PE[t+k]) for k = 1, 2, 5, 10.
The model can memorize the train set but not generalize. What does this tell you about the PE's globally meaningful structure?
Look at SinusoidalPositionalEncoding.__init__. Is there any extra processing after _build_pe that you wouldn't expect?

Fix diff¶

# Remove the shuffling.
self.pe = Tensor(pe, requires_grad=False)

Why this teaches the concept¶

Sinusoidal PE works because of the smooth phase relationship between adjacent positions — Vaswani et al. 2017 §3.5 motivates it precisely so that "for any fixed offset k, PE[t+k] can be represented as a linear function of PE[t]". Shuffling breaks the offset-as-linear-function property. The model can still learn position-specific facts, but it cannot learn the relational structure that makes "the word before was" a meaningful query. RoPE (theory §03) has the same property by construction — and that's why theory §04 predicts RoPE wins on extrapolation. This /break makes the property concrete.

Next: Phase 17's /break on missing causal mask.