English · Español
Break 00 — Train BPE on the English-only half of the §A13 corpus¶
🇪🇸 Entrenamos BPE solo con la mitad inglesa del corpus. Predecimos que la tokenización española colapsa: mean-tokens-per-word duplica, las palabras con tilde se rompen en bytes. Es exactamente el bug que arruina los modelos multilingües cuando alguien "olvidó traducir el dataset de entrenamiento del tokenizador".
Anchors:
LYNX_CORTEX.md§4 / PHASE 11; theory §04 bytes-vs-codepoints;.claude/commands/break.md.
The break¶
In scripts/train_bpe.py:
# BUG: trains BPE on the English half only.
training_text = load_corpus(only_language="en") # was: only_language=None
tokenizer = train_bpe(training_text, vocab_size=512)
Single-line change. The tokenizer still trains, still has a 512-token vocab, still encodes any byte sequence — but the merge schedule is biased toward English bigrams (th, _t, er, ...). Spanish-specific merges (_e, _l, ar, er, os, as, accented characters) never form.
Predict, then run¶
Predictions¶
mean_tokens_per_english_word≈ 1.2 (good — that's what we wanted).mean_tokens_per_spanish_word≈ 2.5 or higher (bad — English-only training didn't learn Spanish bigrams).- Accented characters (
ñ,á) stay as raw byte pairs because no merge fused them. Spanish sentences look liket-r-a-b-a-j-0xC3-0xB3in token form. - After Phase 17 trains on these tokens, Spanish perplexity is ~
(2.5/1.2)^2 ≈ 4×worse than English.
Write predictions in learners/borja/phase-11/notes/breaks.md before running.
Observe¶
just exp 11-bpe-train --tag broken-english-only
just exp 11-tokenize-eval --tokenizer english-only.json
Diagnostics:
- Print 5 sample Spanish sentences as tokens — should show many short tokens / raw byte sequences.
- Compute
mean_tokens_per_wordper language. The ratio should be far from 1.0. - Histogram of vocab-token frequencies on a held-out Spanish corpus. If many vocab tokens are unused, the vocab is biased to English.
Symptom Borja will see¶
mean_tokens_per_spanish_word / mean_tokens_per_english_word ≈ 2.0(target: 1.0).- Spanish sentences look fragmented in the debug print.
- A Phase 17 mini-GPT trained on these tokens shows ~4× worse Spanish loss.
Hidden cause (one sentence)¶
load_corpus(only_language="en") made the tokenizer see only English text, so its merge schedule never compressed Spanish character bigrams.
Hint cascade¶
- Print the last 50 merges the tokenizer learned. Do you see any with Spanish characters?
- What does
mean_tokens_per_wordmeasure, and what's the ratio between languages? Is one suspiciously high? - Look at the
load_corpuscall. Does it use both languages, or just one?
Fix diff¶
Why this teaches the concept¶
Multilingual model performance is bounded by the tokenizer's coverage. The "we used English Wikipedia for BPE training and the model is bad at French" story is a well-known mistake (see Conneau et al. 2020, "Unsupervised Cross-lingual Representation Learning at Scale", §4). Phase 11 surfaces it on a microscopic, controlled corpus where Borja can measure the fragmentation directly and connect it to Phase 15's quadratic attention cost. Phase 12 (corpus design) will reinforce that the tokenizer training set and the model training set should be the same.
Next: Phase 12's /break on shuffled targets.