English · Español

04 — Head-to-head: sinusoidal vs RoPE on §A13 (the experiment to run)¶

🇪🇸 Diseñamos (sin ejecutar) el experimento para comparar sinusoidal vs RoPE en §A13. La hipótesis nula: a longitud de secuencia ≤ 5 (que es lo que tiene §A13), los dos rinden igual. Hipótesis alternativa: RoPE generaliza mejor a contextos un poco más largos (longitud 10–20 sintetizados). Aquí hacemos: el setup, las métricas, las predicciones, y el rationale de por qué esperar lo que se reporta en LLaMA.

Anchors: LYNX_CORTEX.md §4 / PHASE 16; theory §01 sinusoidal; theory §03 RoPE; lab §03 extrapolation compare.

Why this experiment exists¶

Su et al. 2021 ("RoFormer", introducing RoPE) and Touvron et al. 2023 (LLaMA-1) report that RoPE consistently outperforms sinusoidal at long context, and matches it at short context. The §A13 corpus has short sequences (mean length ~5 tokens), so we are not expecting a big win — but we are running the experiment to:

Confirm short-context parity holds in our microscopic regime.
Set up a baseline for any future scale-up (Phase 17 mini-GPT, Phase 30 portal).
Practice the experimental hygiene we'll need for any architecture comparison.

Setup¶

Shared backbone¶

mini-GPT from Phase 17, frozen except for the positional encoding:

class MiniGPT(Module):
    def __init__(self, vocab_size: int = 512, d_model: int = 64,
                 n_layers: int = 2, n_heads: int = 4, max_len: int = 32,
                 pe_kind: Literal["sinusoidal", "rope"] = "sinusoidal"):
        ...

Parameter count: ~50K. Phase 17 §03 computes it.

Two variants¶

Variant	PE applied at	Formula
sinusoidal	input embeddings	`x_t += PE(t)` where `PE(t, 2i) = sin(t/10000^(2i/d))`
RoPE	inside attention on Q,K	`Q_t' = R_θ(t) · Q_t` where `R_θ(t)` is per-pair rotation

Both have zero learnable PE parameters — a deliberate choice for the comparison (learned PE would confound).

Corpus splits¶

Train: §A13 corpus, 240 examples, max length 5.
Val (in-distribution): §A13 held-out, max length 5.
Val (out-of-distribution length): synthetic concatenations of §A13 sentences, lengths 10 / 15 / 20. Hand-curated so the conjugations are still §A13-valid.

The OOD-length val set is the critical one. It tests extrapolation — RoPE's claimed strength.

Predictions (write them down BEFORE running)¶

Metric                        | sinusoidal | RoPE      | Δ
------------------------------+------------+-----------+------
Train loss (final)            | 0.05       | 0.05      | ~0
Val loss (in-dist, T≤5)       | 0.10       | 0.10      | ~0
Val loss (OOD, T=10)          | 0.35       | 0.20      | -0.15
Val loss (OOD, T=15)          | 0.80       | 0.30      | -0.50
Val loss (OOD, T=20)          | 1.50       | 0.45      | -1.05
Conjugation accuracy at T=20  | 65%        | 89%       | +24pp

The big gap at T=20 is the headline.

Why we expect this¶

Sinusoidal PE is added to the input embeddings. At T > max_train_len, the PE values lie outside the range the model has ever seen. The model has no prior over those positions.
RoPE is multiplicative rotation in QK space. Its math is relative (the dot product between R_θ(t) q and R_θ(s) k depends only on t - s, not on t or s separately). Even at T = 20 the relative distances are still in the training distribution (max relative distance was 5, max here is 20 — bigger, but still a clean rotation).

This is the formal argument in Su et al. 2021 §3.4 — "RoPE encodes relative position", whereas sinusoidal encodes absolute position.

What the curves look like¶

Hand-drawn ASCII (real plot in Phase 16 lab 03):

val loss
  ^
1.5+        sinusoidal o
   |              o
1.0+         o
   |       o
0.5+    o
   |  o
   |o     RoPE: ━━━━━ (flat near 0.3 across T)
   +────────────────────────> sequence length T
   0     5      10     15     20

The two curves overlap perfectly at T ≤ 5 (training distribution), then diverge.

Metrics¶

Val cross-entropy loss at each T ∈ {5, 10, 15, 20}.
Conjugation accuracy — does the model produce the correct tense? Phase 20 (eval harness) defines it formally.
Attention pattern visualization — sample a few sentences at T = 20, plot the attention matrix per layer. RoPE's matrix should look "diagonal-band-like" (relative-position structure). Sinusoidal's matrix at T = 20 should look erratic (the model has no idea what to do).
Compute cost — RoPE adds a per-attention-layer cost; quantify it in FLOPs.

What this doesn't prove¶

RoPE is strictly better in all regimes. Phase 36 (frontier architectures) covers ALiBi (a non-PE alternative) and shows RoPE wins on some tasks and loses on others.
The §A13 numbers transfer to large models. They don't directly; large models often see length extrapolation challenges in completely different ways (Press et al. 2022, "Train Short, Test Long").

But for the specific question "should mini-GPT use RoPE or sinusoidal", the experiment will answer it: RoPE is a small-cost, mid-benefit choice with no observed downside at §A13 scale.

When not to extrapolate from this¶

A task with no length variation at inference (e.g., always exactly 5 tokens) shouldn't care which PE — pick whichever is simpler.
A task with sequences ≥ 4× train length — neither PE will save you, you need ALiBi or YaRN (Phase 36).

Citations¶

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y. 2021. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. §3 derives RoPE's relative-position property.
Press, O., Smith, N., Lewis, M. 2022. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (ALiBi). arXiv:2108.12409.
Touvron, H. et al. 2023. "LLaMA." arXiv:2302.13971. §2 confirms RoPE is the default.

Reading + design exercise (no run required)¶

Implement RoPE in src/minimodel/nn/rope.py (Phase 16 lab 02).
Wire pe_kind="rope" into the mini-GPT constructor (Phase 17 — but blueprint now in this phase).
Write the experiment spec following this page's template (use experiments/16-pe-compare/spec.md).
Predict the curves before running. Borja's notes go in learners/borja/phase-16/notes/predictions.md.
Run after Phase 17 closes; reconcile predictions vs reality in reflections.md.

One-paragraph recap¶

The §A13 corpus is too short for the sinusoidal-vs-RoPE comparison to favor RoPE on the in-distribution val set. But on synthetic length-extrapolated val sets (T=10, 15, 20), RoPE's relative-position math gives it a clean win — likely ~24 percentage points of conjugation accuracy at T=20. Why: sinusoidal PE values at T=20 are outside what the model has seen during training; RoPE's relative-position dot product stays in-distribution because it only depends on the difference t - s. This is the result Su et al. 2021 and LLaMA report, scaled down to §A13. Phase 16 lab 03 runs the experiment.

Prev: 03-rope.md Next: Phase 17 (mini-GPT).