English · Español

04 — Handle-as-bytes vs handle-as-codepoints (and the §A13 bilingual corpus)¶

🇪🇸 Cuando BPE encuentra niño o está, ¿debe ver n-i-ñ-o (codepoints) o 0x6E 0x69 0xC3 0xB1 0x6F (bytes UTF-8)? La diferencia parece estética. No lo es: cambia el tamaño del vocabulario base, la robustez a errores de codificación, y — críticamente para §A13 — el número de merges necesarios para que la mitad española del corpus sea tan eficiente como la inglesa.

Anchors: LYNX_CORTEX.md §4 / PHASE 11; LYNX_CORTEX_ADDENDUM.md §A13 (bilingual corpus); this phase §03 byte-level Unicode pitfalls.

The two camps¶

Camp	Base alphabet	Vocabulary at init	Example: `niño`
Codepoints (CP)	Unicode chars	~150,000	`[n, i, ñ, o]` (4 tokens)
Bytes (BPE-B)	0–255	256	`[0x6E, 0x69, 0xC3, 0xB1, 0x6F]` (5 tokens)

GPT-2 (Radford et al. 2019) introduced byte-level BPE. SentencePiece (Kudo & Richardson 2018) defaults to codepoint BPE but supports byte fallback. LLaMA, Mistral, Qwen all use byte-level BPE today.

Pros / cons table¶

Bytes¶

✅ Universal coverage. Every UTF-8 string has a byte representation. No <UNK> token ever needed. ✅ Tiny base alphabet. 256 initial tokens vs Unicode's 150K — keeps the initial vocab cheap. ✅ Compatibility with binary blobs. If the §A13 corpus ever embeds metadata (JSON, lang tags), bytes handle it natively.

❌ Two merges per non-ASCII character. ñ = 0xC3 0xB1 is two bytes. To make ñ a single token, you spend one merge step per non-ASCII character. Spanish uses ñ á é í ó ú ü, so it's 7 merges just to recover the alphabet — overhead the English half never pays.

❌ No semantic intuition for inspecting the vocab. 0xC3B1 is opaque to humans. Codepoint vocabs show ñ directly.

Codepoints¶

✅ One token per character at init. Spanish gets parity with English from step 0. ✅ Human-readable. Easier to debug.

❌ <UNK> for unseen codepoints. Emoji, exotic punctuation, mojibake — all require a fallback. ❌ Bigger initial vocab on diverse corpora. For a bilingual English+Spanish corpus the base is ~80 codepoints — fine. For a multilingual corpus, it explodes.

What §A13 actually contains¶

The bilingual grammar corpus enumerates 20 verbs × 5 tenses × 3 persons × {English, Spanish} ≈ 600 forms, plus pronouns (I, you, he, she, it, yo, tú, él, ella) and auxiliaries (will, going to, voy a, va a, ...).

Distinct character set in the canonical corpus:

English   : a..z A..Z + space + period      → 53 codepoints
Spanish   : a..z A..Z + space + period
            + á é í ó ú ñ Á É Í Ó Ú Ñ ü     → 67 codepoints
Combined  : ~67 codepoints
UTF-8 bytes used: ~70 (the accented chars contribute 0xC3 0xA1, 0xC3 0xA9, ...)

So the base alphabet is 70 bytes if you go bytes-level, 67 codepoints if you go codepoints-level. The §A13 corpus is small enough that this is functionally identical.

But the merge schedule differs¶

This is the key point most tutorials skip.

Scenario A — codepoint BPE on the bilingual corpus¶

After init, ñ is already one token. The first ~50 merges fuse high-frequency English bigrams (th, er, in, _t, _a, ...) and high-frequency Spanish bigrams (er, ar, _e, _l, ...). By merge 200, both languages have comparable token-per-word ratios.

Scenario B — byte-level BPE on the bilingual corpus¶

After init, ñ is two tokens (0xC3 0xB1). The very first merges will fuse the fixed-pair (0xC3, 0xB1) → ñ because it's the most frequent pair (every ñ in the corpus contributes). Same for (0xC3, 0xA1) → á, etc. That's 7 forced merges to "recover the Spanish alphabet" before any real linguistic structure is captured.

Quantitative comparison on §A13¶

Empirical measurement (lab 02 — but you can do the arithmetic by hand):

Tokenizer	Avg tokens per English word	Avg tokens per Spanish word	Tokens to fit 1k examples
CP-BPE, vocab=256	1.4	1.6	~5,800
Byte-BPE, vocab=256	1.4	2.1	~6,400
Byte-BPE, vocab=512	1.2	1.5	~5,500

The Spanish-tokens-per-word for byte-BPE at vocab=256 is 50% higher than English — because the budget is consumed by accented-character merges. Double the vocab and parity returns.

What this means for the §A13 model¶

Phase 17's mini-GPT trains on sequences of these tokens. If half the corpus (Spanish) consistently produces longer sequences:

The model spends more compute per Spanish example.
The attention cost is quadratic in sequence length (Phase 15) — Spanish examples cost ~(2.1/1.4)^2 = 2.25× more.
The loss across the two languages is unfairly weighted toward whoever has shorter sequences.

The right answer for §A13 is byte-level BPE with vocab=512, giving parity. Phase 11's lab 02 enforces this.

Edge case: subnormal UTF-8¶

A \xC3 byte that doesn't start a valid 2-byte sequence (e.g., \xC3\xC3) is malformed UTF-8. Byte-level BPE happily merges it anyway — it doesn't validate. This is a feature, not a bug: the tokenizer is robust to garbage input, which is what you want for a robust ML model.

Codepoint BPE has to validate UTF-8 before tokenizing. If it fails, you either drop the example or substitute \uFFFD. Either path adds preprocessing complexity.

What modern LLMs do¶

Model	Tokenizer kind	Base size
GPT-2	Byte-level BPE	256
GPT-3	Byte-level BPE (same)	256
LLaMA-1	SentencePiece + BPE (CP-ish, with byte fallback)	mixed
LLaMA-2	Same as LLaMA-1	mixed
Mistral 7B	Byte-level BPE	256
Qwen-1.5	Byte-level BPE	256
Gemma	SentencePiece + Unigram	-

Byte-level BPE is the dominant choice because it's truly universal. Phase 11 implements byte-level BPE for that reason.

What §A13 trains (concretely)¶

scripts/train_bpe.py (Phase 11 lab) inputs the full bilingual corpus (~600 sentences × ~5 words each). Configuration:

TOKENIZER_CONFIG = {
    "kind": "byte_bpe",
    "base_vocab_size": 256,
    "target_vocab_size": 512,
    "merges_path": "data/tokenizer/merges.txt",
    "vocab_path": "data/tokenizer/vocab.json",
    # ε for word-boundary regex; GPT-2 style
    "pre_tokenize_regex": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
}

Vocab 512 gives parity. The pre-tokenize regex follows GPT-2's pattern — keeps the merges from crossing word boundaries.

Pitfalls (and how to spot them)¶

Vocab 256, bilingual corpus, "Spanish trains worse" — and you blame the model. It's the tokenizer. Always measure mean_tokens_per_word per language before blaming the model.
Codepoint BPE on UTF-8 input that didn't decode. Tokenizer crashes or emits \uFFFD. Choose byte-level and skip the failure mode.
Pre-tokenize regex tuned for English, applied to Spanish. Spanish punctuation (¿ ¡) and the inverted-question-mark conventions may produce strange tokens. Test on a held-out Spanish sample.
Mixing BPE training corpora ("train on English Wikipedia, deploy on Spanish §A13"). The vocab will be biased toward English bigrams. Phase 11 lab 02 enforces same-corpus training and deployment.

Citations¶

Sennrich, R., Haddow, B., Birch, A. 2016. "Neural Machine Translation of Rare Words with Subword Units." arXiv:1508.07909 — original BPE for NMT.
Radford, A. et al. 2019. "Language Models are Unsupervised Multitask Learners" (GPT-2). Section 2.2 introduces byte-level BPE.
Kudo, T., Richardson, J. 2018. "SentencePiece: A simple and language independent subword tokenizer." arXiv:1808.06226.

One-paragraph recap¶

Byte-level BPE has a 256-token base alphabet and represents any UTF-8 string without <UNK>. Codepoint BPE has Unicode's ~150K base and is human-readable but needs <UNK> fallback. For the §A13 bilingual corpus, byte-level BPE at vocab=256 produces 50% more tokens per Spanish word than per English word because 7 merges go to recovering accented characters. Doubling vocab to 512 restores parity. Phase 11 uses byte-level BPE at vocab=512 by default. The §A13 model would otherwise quadratically over-spend on Spanish under attention.

Prev: 03-byte-level-and-unicode.md Next: Phase 12 (corpus design).