English · Español
04 — Head-to-head: sinusoidal vs RoPE on §A13 (the experiment to run)¶
🇪🇸 Diseñamos (sin ejecutar) el experimento para comparar sinusoidal vs RoPE en §A13. La hipótesis nula: a longitud de secuencia ≤ 5 (que es lo que tiene §A13), los dos rinden igual. Hipótesis alternativa: RoPE generaliza mejor a contextos un poco más largos (longitud 10–20 sintetizados). Aquí hacemos: el setup, las métricas, las predicciones, y el rationale de por qué esperar lo que se reporta en LLaMA.
Anchors:
LYNX_CORTEX.md§4 / PHASE 16; theory §01 sinusoidal; theory §03 RoPE; lab §03 extrapolation compare.
Why this experiment exists¶
Su et al. 2021 ("RoFormer", introducing RoPE) and Touvron et al. 2023 (LLaMA-1) report that RoPE consistently outperforms sinusoidal at long context, and matches it at short context. The §A13 corpus has short sequences (mean length ~5 tokens), so we are not expecting a big win — but we are running the experiment to:
- Confirm short-context parity holds in our microscopic regime.
- Set up a baseline for any future scale-up (Phase 17 mini-GPT, Phase 30 portal).
- Practice the experimental hygiene we'll need for any architecture comparison.
Setup¶
Shared backbone¶
mini-GPT from Phase 17, frozen except for the positional encoding:
class MiniGPT(Module):
def __init__(self, vocab_size: int = 512, d_model: int = 64,
n_layers: int = 2, n_heads: int = 4, max_len: int = 32,
pe_kind: Literal["sinusoidal", "rope"] = "sinusoidal"):
...
Parameter count: ~50K. Phase 17 §03 computes it.
Two variants¶
| Variant | PE applied at | Formula |
|---|---|---|
| sinusoidal | input embeddings | x_t += PE(t) where PE(t, 2i) = sin(t/10000^(2i/d)) |
| RoPE | inside attention on Q,K | Q_t' = R_θ(t) · Q_t where R_θ(t) is per-pair rotation |
Both have zero learnable PE parameters — a deliberate choice for the comparison (learned PE would confound).
Corpus splits¶
- Train: §A13 corpus, 240 examples, max length 5.
- Val (in-distribution): §A13 held-out, max length 5.
- Val (out-of-distribution length): synthetic concatenations of §A13 sentences, lengths 10 / 15 / 20. Hand-curated so the conjugations are still §A13-valid.
The OOD-length val set is the critical one. It tests extrapolation — RoPE's claimed strength.
Predictions (write them down BEFORE running)¶
Metric | sinusoidal | RoPE | Δ
------------------------------+------------+-----------+------
Train loss (final) | 0.05 | 0.05 | ~0
Val loss (in-dist, T≤5) | 0.10 | 0.10 | ~0
Val loss (OOD, T=10) | 0.35 | 0.20 | -0.15
Val loss (OOD, T=15) | 0.80 | 0.30 | -0.50
Val loss (OOD, T=20) | 1.50 | 0.45 | -1.05
Conjugation accuracy at T=20 | 65% | 89% | +24pp
The big gap at T=20 is the headline.
Why we expect this¶
- Sinusoidal PE is added to the input embeddings. At
T > max_train_len, the PE values lie outside the range the model has ever seen. The model has no prior over those positions. - RoPE is multiplicative rotation in QK space. Its math is relative (the dot product between
R_θ(t) qandR_θ(s) kdepends only ont - s, not ontorsseparately). Even atT = 20the relative distances are still in the training distribution (max relative distance was 5, max here is 20 — bigger, but still a clean rotation).
This is the formal argument in Su et al. 2021 §3.4 — "RoPE encodes relative position", whereas sinusoidal encodes absolute position.
What the curves look like¶
Hand-drawn ASCII (real plot in Phase 16 lab 03):
val loss
^
1.5+ sinusoidal o
| o
1.0+ o
| o
0.5+ o
| o
|o RoPE: ━━━━━ (flat near 0.3 across T)
+────────────────────────> sequence length T
0 5 10 15 20
The two curves overlap perfectly at T ≤ 5 (training distribution), then diverge.
Metrics¶
- Val cross-entropy loss at each
T ∈ {5, 10, 15, 20}. - Conjugation accuracy — does the model produce the correct tense? Phase 20 (eval harness) defines it formally.
- Attention pattern visualization — sample a few sentences at
T = 20, plot the attention matrix per layer. RoPE's matrix should look "diagonal-band-like" (relative-position structure). Sinusoidal's matrix atT = 20should look erratic (the model has no idea what to do). - Compute cost — RoPE adds a per-attention-layer cost; quantify it in FLOPs.
What this doesn't prove¶
- RoPE is strictly better in all regimes. Phase 36 (frontier architectures) covers ALiBi (a non-PE alternative) and shows RoPE wins on some tasks and loses on others.
- The §A13 numbers transfer to large models. They don't directly; large models often see length extrapolation challenges in completely different ways (Press et al. 2022, "Train Short, Test Long").
But for the specific question "should mini-GPT use RoPE or sinusoidal", the experiment will answer it: RoPE is a small-cost, mid-benefit choice with no observed downside at §A13 scale.
When not to extrapolate from this¶
- A task with no length variation at inference (e.g., always exactly 5 tokens) shouldn't care which PE — pick whichever is simpler.
- A task with sequences ≥ 4× train length — neither PE will save you, you need ALiBi or YaRN (Phase 36).
Citations¶
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. 2021. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. §3 derives RoPE's relative-position property.
- Press, O., Smith, N., Lewis, M. 2022. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (ALiBi). arXiv:2108.12409.
- Touvron, H. et al. 2023. "LLaMA." arXiv:2302.13971. §2 confirms RoPE is the default.
Reading + design exercise (no run required)¶
- Implement RoPE in
src/minimodel/nn/rope.py(Phase 16 lab 02). - Wire
pe_kind="rope"into the mini-GPT constructor (Phase 17 — but blueprint now in this phase). - Write the experiment spec following this page's template (use
experiments/16-pe-compare/spec.md). - Predict the curves before running. Borja's notes go in
learners/borja/phase-16/notes/predictions.md. - Run after Phase 17 closes; reconcile predictions vs reality in
reflections.md.
One-paragraph recap¶
The §A13 corpus is too short for the sinusoidal-vs-RoPE comparison to favor RoPE on the in-distribution val set. But on synthetic length-extrapolated val sets (T=10, 15, 20), RoPE's relative-position math gives it a clean win — likely ~24 percentage points of conjugation accuracy at T=20. Why: sinusoidal PE values at T=20 are outside what the model has seen during training; RoPE's relative-position dot product stays in-distribution because it only depends on the difference t - s. This is the result Su et al. 2021 and LLaMA report, scaled down to §A13. Phase 16 lab 03 runs the experiment.
Prev: 03-rope.md
Next: Phase 17 (mini-GPT).