Skip to content

English · Español

Break 00 — Train BPE on the English-only half of the §A13 corpus

🇪🇸 Entrenamos BPE solo con la mitad inglesa del corpus. Predecimos que la tokenización española colapsa: mean-tokens-per-word duplica, las palabras con tilde se rompen en bytes. Es exactamente el bug que arruina los modelos multilingües cuando alguien "olvidó traducir el dataset de entrenamiento del tokenizador".

Anchors: LYNX_CORTEX.md §4 / PHASE 11; theory §04 bytes-vs-codepoints; .claude/commands/break.md.


The break

In scripts/train_bpe.py:

# BUG: trains BPE on the English half only.
training_text = load_corpus(only_language="en")    # was: only_language=None
tokenizer = train_bpe(training_text, vocab_size=512)

Single-line change. The tokenizer still trains, still has a 512-token vocab, still encodes any byte sequence — but the merge schedule is biased toward English bigrams (th, _t, er, ...). Spanish-specific merges (_e, _l, ar, er, os, as, accented characters) never form.

Predict, then run

Predictions

  1. mean_tokens_per_english_word ≈ 1.2 (good — that's what we wanted).
  2. mean_tokens_per_spanish_word2.5 or higher (bad — English-only training didn't learn Spanish bigrams).
  3. Accented characters (ñ, á) stay as raw byte pairs because no merge fused them. Spanish sentences look like t-r-a-b-a-j-0xC3-0xB3 in token form.
  4. After Phase 17 trains on these tokens, Spanish perplexity is ~(2.5/1.2)^2 ≈ 4× worse than English.

Write predictions in learners/borja/phase-11/notes/breaks.md before running.

Observe

just exp 11-bpe-train --tag broken-english-only
just exp 11-tokenize-eval --tokenizer english-only.json

Diagnostics:

  1. Print 5 sample Spanish sentences as tokens — should show many short tokens / raw byte sequences.
  2. Compute mean_tokens_per_word per language. The ratio should be far from 1.0.
  3. Histogram of vocab-token frequencies on a held-out Spanish corpus. If many vocab tokens are unused, the vocab is biased to English.

Symptom Borja will see

  • mean_tokens_per_spanish_word / mean_tokens_per_english_word ≈ 2.0 (target: 1.0).
  • Spanish sentences look fragmented in the debug print.
  • A Phase 17 mini-GPT trained on these tokens shows ~4× worse Spanish loss.

Hidden cause (one sentence)

load_corpus(only_language="en") made the tokenizer see only English text, so its merge schedule never compressed Spanish character bigrams.

Hint cascade

  1. Print the last 50 merges the tokenizer learned. Do you see any with Spanish characters?
  2. What does mean_tokens_per_word measure, and what's the ratio between languages? Is one suspiciously high?
  3. Look at the load_corpus call. Does it use both languages, or just one?

Fix diff

training_text = load_corpus(only_language=None)    # both languages

Why this teaches the concept

Multilingual model performance is bounded by the tokenizer's coverage. The "we used English Wikipedia for BPE training and the model is bad at French" story is a well-known mistake (see Conneau et al. 2020, "Unsupervised Cross-lingual Representation Learning at Scale", §4). Phase 11 surfaces it on a microscopic, controlled corpus where Borja can measure the fragmentation directly and connect it to Phase 15's quadratic attention cost. Phase 12 (corpus design) will reinforce that the tokenizer training set and the model training set should be the same.


Next: Phase 12's /break on shuffled targets.