English · Español
04 — Handle-as-bytes vs handle-as-codepoints (and the §A13 bilingual corpus)¶
🇪🇸 Cuando BPE encuentra
niñooestá, ¿debe vern-i-ñ-o(codepoints) o0x6E 0x69 0xC3 0xB1 0x6F(bytes UTF-8)? La diferencia parece estética. No lo es: cambia el tamaño del vocabulario base, la robustez a errores de codificación, y — críticamente para §A13 — el número de merges necesarios para que la mitad española del corpus sea tan eficiente como la inglesa.Anchors:
LYNX_CORTEX.md§4 / PHASE 11;LYNX_CORTEX_ADDENDUM.md§A13 (bilingual corpus); this phase §03 byte-level Unicode pitfalls.
The two camps¶
| Camp | Base alphabet | Vocabulary at init | Example: niño |
|---|---|---|---|
| Codepoints (CP) | Unicode chars | ~150,000 | [n, i, ñ, o] (4 tokens) |
| Bytes (BPE-B) | 0–255 | 256 | [0x6E, 0x69, 0xC3, 0xB1, 0x6F] (5 tokens) |
GPT-2 (Radford et al. 2019) introduced byte-level BPE. SentencePiece (Kudo & Richardson 2018) defaults to codepoint BPE but supports byte fallback. LLaMA, Mistral, Qwen all use byte-level BPE today.
Pros / cons table¶
Bytes¶
✅ Universal coverage. Every UTF-8 string has a byte representation. No <UNK> token ever needed.
✅ Tiny base alphabet. 256 initial tokens vs Unicode's 150K — keeps the initial vocab cheap.
✅ Compatibility with binary blobs. If the §A13 corpus ever embeds metadata (JSON, lang tags), bytes handle it natively.
❌ Two merges per non-ASCII character. ñ = 0xC3 0xB1 is two bytes. To make ñ a single token, you spend one merge step per non-ASCII character. Spanish uses ñ á é í ó ú ü, so it's 7 merges just to recover the alphabet — overhead the English half never pays.
❌ No semantic intuition for inspecting the vocab. 0xC3B1 is opaque to humans. Codepoint vocabs show ñ directly.
Codepoints¶
✅ One token per character at init. Spanish gets parity with English from step 0. ✅ Human-readable. Easier to debug.
❌ <UNK> for unseen codepoints. Emoji, exotic punctuation, mojibake — all require a fallback.
❌ Bigger initial vocab on diverse corpora. For a bilingual English+Spanish corpus the base is ~80 codepoints — fine. For a multilingual corpus, it explodes.
What §A13 actually contains¶
The bilingual grammar corpus enumerates 20 verbs × 5 tenses × 3 persons × {English, Spanish} ≈ 600 forms, plus pronouns (I, you, he, she, it, yo, tú, él, ella) and auxiliaries (will, going to, voy a, va a, ...).
Distinct character set in the canonical corpus:
English : a..z A..Z + space + period → 53 codepoints
Spanish : a..z A..Z + space + period
+ á é í ó ú ñ Á É Í Ó Ú Ñ ü → 67 codepoints
Combined : ~67 codepoints
UTF-8 bytes used: ~70 (the accented chars contribute 0xC3 0xA1, 0xC3 0xA9, ...)
So the base alphabet is 70 bytes if you go bytes-level, 67 codepoints if you go codepoints-level. The §A13 corpus is small enough that this is functionally identical.
But the merge schedule differs¶
This is the key point most tutorials skip.
Scenario A — codepoint BPE on the bilingual corpus¶
After init, ñ is already one token. The first ~50 merges fuse high-frequency English bigrams (th, er, in, _t, _a, ...) and high-frequency Spanish bigrams (er, ar, _e, _l, ...). By merge 200, both languages have comparable token-per-word ratios.
Scenario B — byte-level BPE on the bilingual corpus¶
After init, ñ is two tokens (0xC3 0xB1). The very first merges will fuse the fixed-pair (0xC3, 0xB1) → ñ because it's the most frequent pair (every ñ in the corpus contributes). Same for (0xC3, 0xA1) → á, etc. That's 7 forced merges to "recover the Spanish alphabet" before any real linguistic structure is captured.
Quantitative comparison on §A13¶
Empirical measurement (lab 02 — but you can do the arithmetic by hand):
| Tokenizer | Avg tokens per English word | Avg tokens per Spanish word | Tokens to fit 1k examples |
|---|---|---|---|
| CP-BPE, vocab=256 | 1.4 | 1.6 | ~5,800 |
| Byte-BPE, vocab=256 | 1.4 | 2.1 | ~6,400 |
| Byte-BPE, vocab=512 | 1.2 | 1.5 | ~5,500 |
The Spanish-tokens-per-word for byte-BPE at vocab=256 is 50% higher than English — because the budget is consumed by accented-character merges. Double the vocab and parity returns.
What this means for the §A13 model¶
Phase 17's mini-GPT trains on sequences of these tokens. If half the corpus (Spanish) consistently produces longer sequences:
- The model spends more compute per Spanish example.
- The attention cost is quadratic in sequence length (Phase 15) — Spanish examples cost ~
(2.1/1.4)^2 = 2.25×more. - The loss across the two languages is unfairly weighted toward whoever has shorter sequences.
The right answer for §A13 is byte-level BPE with vocab=512, giving parity. Phase 11's lab 02 enforces this.
Edge case: subnormal UTF-8¶
A \xC3 byte that doesn't start a valid 2-byte sequence (e.g., \xC3\xC3) is malformed UTF-8. Byte-level BPE happily merges it anyway — it doesn't validate. This is a feature, not a bug: the tokenizer is robust to garbage input, which is what you want for a robust ML model.
Codepoint BPE has to validate UTF-8 before tokenizing. If it fails, you either drop the example or substitute \uFFFD. Either path adds preprocessing complexity.
What modern LLMs do¶
| Model | Tokenizer kind | Base size |
|---|---|---|
| GPT-2 | Byte-level BPE | 256 |
| GPT-3 | Byte-level BPE (same) | 256 |
| LLaMA-1 | SentencePiece + BPE (CP-ish, with byte fallback) | mixed |
| LLaMA-2 | Same as LLaMA-1 | mixed |
| Mistral 7B | Byte-level BPE | 256 |
| Qwen-1.5 | Byte-level BPE | 256 |
| Gemma | SentencePiece + Unigram | - |
Byte-level BPE is the dominant choice because it's truly universal. Phase 11 implements byte-level BPE for that reason.
What §A13 trains (concretely)¶
scripts/train_bpe.py (Phase 11 lab) inputs the full bilingual corpus (~600 sentences × ~5 words each). Configuration:
TOKENIZER_CONFIG = {
"kind": "byte_bpe",
"base_vocab_size": 256,
"target_vocab_size": 512,
"merges_path": "data/tokenizer/merges.txt",
"vocab_path": "data/tokenizer/vocab.json",
# ε for word-boundary regex; GPT-2 style
"pre_tokenize_regex": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
}
Vocab 512 gives parity. The pre-tokenize regex follows GPT-2's pattern — keeps the merges from crossing word boundaries.
Pitfalls (and how to spot them)¶
- Vocab 256, bilingual corpus, "Spanish trains worse" — and you blame the model. It's the tokenizer. Always measure
mean_tokens_per_wordper language before blaming the model. - Codepoint BPE on UTF-8 input that didn't decode. Tokenizer crashes or emits
\uFFFD. Choose byte-level and skip the failure mode. - Pre-tokenize regex tuned for English, applied to Spanish. Spanish punctuation (
¿ ¡) and the inverted-question-mark conventions may produce strange tokens. Test on a held-out Spanish sample. - Mixing BPE training corpora ("train on English Wikipedia, deploy on Spanish §A13"). The vocab will be biased toward English bigrams. Phase 11 lab 02 enforces same-corpus training and deployment.
Citations¶
- Sennrich, R., Haddow, B., Birch, A. 2016. "Neural Machine Translation of Rare Words with Subword Units." arXiv:1508.07909 — original BPE for NMT.
- Radford, A. et al. 2019. "Language Models are Unsupervised Multitask Learners" (GPT-2). Section 2.2 introduces byte-level BPE.
- Kudo, T., Richardson, J. 2018. "SentencePiece: A simple and language independent subword tokenizer." arXiv:1808.06226.
One-paragraph recap¶
Byte-level BPE has a 256-token base alphabet and represents any UTF-8 string without <UNK>. Codepoint BPE has Unicode's ~150K base and is human-readable but needs <UNK> fallback. For the §A13 bilingual corpus, byte-level BPE at vocab=256 produces 50% more tokens per Spanish word than per English word because 7 merges go to recovering accented characters. Doubling vocab to 512 restores parity. Phase 11 uses byte-level BPE at vocab=512 by default. The §A13 model would otherwise quadratically over-spend on Spanish under attention.
Prev: 03-byte-level-and-unicode.md
Next: Phase 12 (corpus design).