English · Español
Break 00 — Shuffle the positional encodings across positions¶
🇪🇸 Tomamos las codificaciones posicionales sinusoidales y barajamos las filas.
PE[t]ya no corresponde a la posiciónt, sino a una posición aleatoria. La atención sigue funcionando, pero el modelo recibe información posicional permutada. Predicción: la pérdida se queda enlog(V)(no aprende nada), porque la posición ahora es ruido. Comparar contra "sin PE" muestra que alguna información posicional, incluso inconsistente, es estructuralmente diferente de ninguna.Anchors:
LYNX_CORTEX.md§4 / PHASE 16; theory §00 permutation-equivariance; theory §01 sinusoidal;.claude/commands/break.md.
The break¶
In src/minimodel/nn/positional.py:
class SinusoidalPositionalEncoding(Module):
def __init__(self, max_len: int, d_model: int, seed: int = 0) -> None:
super().__init__()
pe = self._build_pe(max_len, d_model) # shape (max_len, d_model)
# BUG: shuffle the rows of pe.
rng = np.random.default_rng(seed)
perm = rng.permutation(max_len)
pe = pe[perm] # permute rows
self.pe = Tensor(pe, requires_grad=False)
The PE values are still valid sinusoidal vectors; they're just attached to the wrong positions. The pattern at t = 3 might be the pattern that should be at t = 17.
Predict, then run¶
The model relies on PE to break attention's permutation-equivariance (theory §00). Sinusoidal PE encodes position via a specific phase relationship — PE[t] and PE[t+1] are "near" each other in cosine similarity. Shuffling destroys this neighbor-relation.
The attention then has positional information that is:
- Per-position consistent (within a single forward pass,
PE[3]is always the same vector). - Globally meaningless (position 3's vector might be similar to position 17's, not position 2 or 4).
This means:
- Identical sentences get identical embeddings (so the model can technically memorize them).
- But the model can't learn that "the next word depends on the position just before it" — because "just before" is no longer detectable from the PE.
Predictions¶
- Train loss drops slowly as the model memorizes the train set.
- Val loss stays near
log(V)because the shuffled PE doesn't generalize — train examples have one shuffling, val examples have the same shuffling, but the model has learnedf(token_3, pe_17)instead off(token_3, position_3). - If you re-shuffle the PE between training and eval (a separate bug), it gets even worse — different PE values for the same positions.
- Compared to "no PE at all": the broken model is at least as bad. Sometimes worse.
Write predictions in learners/borja/phase-16/notes/breaks.md before running.
Observe¶
Diagnostics:
- Plot train and val loss. Train should slowly memorize; val should plateau.
- Compute attention matrices at the trained model on a held-out sentence. Compare to the unmodified model — the broken model's attention rows look random.
- Compute the cosine similarity
cos(PE[t], PE[t+1])— for sinusoidal PE this is close to 1 (neighbors are similar). For shuffled PE, it's ~0.
Symptom Borja will see¶
- Train loss decreases (memorization).
- Val loss stays at
log(V) ≈ log(512) = 6.24or close. - Train/val gap is huge.
cos(PE[t], PE[t+1])is ~0 (was ~0.99 in the un-broken case).
Hidden cause (one sentence)¶
pe = pe[perm] shuffled the positional encoding rows, destroying the smooth phase relationship that lets sinusoidal PE encode relative position.
Hint cascade¶
- What property does the sinusoidal PE encode that makes nearby positions have similar vectors? Print
cos(PE[t], PE[t+k])fork = 1, 2, 5, 10. - The model can memorize the train set but not generalize. What does this tell you about the PE's globally meaningful structure?
- Look at
SinusoidalPositionalEncoding.__init__. Is there any extra processing after_build_pethat you wouldn't expect?
Fix diff¶
Why this teaches the concept¶
Sinusoidal PE works because of the smooth phase relationship between adjacent positions — Vaswani et al. 2017 §3.5 motivates it precisely so that "for any fixed offset k, PE[t+k] can be represented as a linear function of PE[t]". Shuffling breaks the offset-as-linear-function property. The model can still learn position-specific facts, but it cannot learn the relational structure that makes "the word before was" a meaningful query. RoPE (theory §03) has the same property by construction — and that's why theory §04 predicts RoPE wins on extrapolation. This /break makes the property concrete.
Next: Phase 17's /break on missing causal mask.