English · Español
00 — Why tokenization is the first compromise¶
🇪🇸 Un tokenizer convierte texto en una secuencia de IDs enteros — el alfabeto sobre el que el modelo opera. Cada elección (caracteres, palabras, sub-palabras, bytes) tiene un coste: cobertura, longitud de secuencia, manejo de errores. La elección filtra todo lo demás del proyecto.
What a tokenizer is¶
A tokenizer is a deterministic bijection (or near-bijection) between strings and finite sequences of integer IDs. The ID is what the model actually sees. The model's vocabulary \(V\) is the size of the token space.
Pseudo-formally:
with the property decode(encode(s)) ≈ s (exact for byte-level BPE; sometimes lossy for character-normalized tokenizers).
Two non-obvious consequences:
-
The vocab size \(V\) controls model size. The embedding matrix is
(V, d)and the output projection is(d, V). For \(d = 768\) and \(V = 50{,}000\): 38M params just for the embedding. Doubling \(V\) adds another 38M. The choice of \(V\) is a model-architecture decision, made here, before any model exists. -
The token length distribution controls sequence-length costs. A character-level tokenizer makes
worksinto 5 tokens; a word-level tokenizer makes it 1; a subword tokenizer makes it 1 or 2 depending on training frequency. The transformer's attention is \(O(L^2)\) in sequence length, so a 5× longer tokenization is a 25× compute cost. The choice is a runtime decision.
The tokenizer is the project's first compromise. It cannot be re-decided later without retraining everything.
The three options (and the one we will build)¶
There are three families of tokenizers in modern use:
Family A — character-level¶
"I work" → ['I', ' ', 'w', 'o', 'r', 'k'] → [73, 32, 119, 111, 114, 107].
- Pro: trivial. Never out-of-vocab.
- Con: sequence length is 5–10× longer than other methods. Quadratic attention cost grows 25–100×. Compute-prohibitive.
- Where used: very small character-level LMs, ByT5 (Google, 2021) — but not the mainstream LLM lineage.
Family B — word-level¶
"I worked" → ['I', ' ', 'worked'] → [12, 0, 487].
- Pro: short sequences. Tokens carry semantic meaning.
- Con: out-of-vocabulary problem is fatal. Misspell
wroked→['I', ' ', '<unk>']. The model cannot even see the misspelled input. For grammar correction (the Phase 32 agent task), this is a non-starter — the agent must see misspellings to correct them. - Where used: classical NLP pre-2017. Mostly retired.
Family C — subword / BPE¶
"he works" → ['he', ' work', 's'] → [58, 109, 23].
- Pro: balances both. Sequences stay short (close to word-level for frequent forms). Never out-of-vocab if implemented at the byte level. The
-ssuffix and theworkstem are separate tokens — the model can learn that the-stoken, in this position, marks 3rd-person singular. That's exactly the structure our Phase 32 grammar-tutor agent needs. - Con: more code, more design choices.
- Where used: every modern LLM. GPT-⅔/4, Llama, Claude, Mistral, etc.
We implement byte-level BPE — the variant where the base alphabet is the 256 possible byte values (not Unicode characters). This makes the tokenizer:
- OOV-free. Any byte string is tokenizable.
- Unicode-robust. Doesn't care about NFC/NFD normalization, BOM markers, weird encodings.
- Bilingual without ceremony. Spanish
mañana(8 bytes in UTF-8) and Englishtomorrow(8 bytes) tokenize side-by-side in one shared vocab. No language tag needed. - Misspelling-safe. Adversarial or accidental garbage bytes still tokenize.
The cost: the base alphabet is bytes, not characters. So mañana starts as 8 byte-singletons (the ñ is 2 bytes: 0xC3 0xB1). After BPE training on a Spanish-containing corpus, the bytes of ñ get merged into one symbol — and then maybe añ, aña, mañana itself if frequent enough.
For our English-verb-grammar corpus with paired Spanish translations (per §A13), byte-level is the only correct choice: it's the only family that handles the bilingual signal natively without any extra mechanism.
Why BPE specifically (and not other subword schemes)¶
The subword family has several variants:
- BPE (Sennrich et al., 2016). Greedy merge by frequency. Simple, deterministic, fast.
- WordPiece (Schuster & Nakajima, 2012). Like BPE but merges to maximize likelihood under a unigram model. Used by BERT.
- Unigram (Kudo, 2018). Train a unigram language model, prune low-probability tokens. Used by SentencePiece / T5.
All three reach similar quality on benchmarks. We pick BPE because:
- The algorithm is the simplest of the three. Borja can implement it from scratch in ~150 lines of Python.
- It is what GPT-2 uses (and we borrow GPT-2's byte-level variant). Connects directly to modern open-source LLM code in Phase 24.
- It is deterministic. No EM training, no ambiguity. Reproducible from a seed.
WordPiece and Unigram are survey-only in theory 02; we don't implement them.
What the choice leaks into¶
- Phase 12 corpus: the corpus is the data BPE trains on. Its design (20 verbs × 5 tenses × 3 persons + Spanish pairs) determines which merges win.
- Phase 13 embeddings: the embedding matrix is
(V, d)whereVis set here. - Phase 15 attention: the input sequence length depends on tokenization granularity. A 6-word sentence becomes ~6–10 tokens; attention cost is then ~36–100 inner products.
- Phase 18 training: the cross-entropy loss is over \(V\) classes; small \(V\) is cheap per step.
- Phase 22 inference: generation samples one token at a time; longer tokens (e.g.,
worksas one token) mean fewer samples per character of output. - Phase 32 grammar-tutor agent: the agent will receive a learner's misspelled or mis-conjugated sentence; byte-level BPE guarantees it always tokenizes.
Every phase from 11 onward is downstream of this one. Choose well.
A quick semantic check (the post-training sanity gate)¶
After training BPE on our English-verb corpus with Spanish pairs, the top merges should be interpretable as morphologically or distributionally meaningful substrings. Expected wins (the lab DoD checks for these):
work,play,walk,listen— frequent verb stems with their leading space.-s(after stems) — 3rd-person singular present marker.-ed— regular past simple marker.will— simple future marker.to—going tofuture marker and infinitive marker.he,she,I,you,it— pronouns.- For the Spanish side:
trab,come,ar,er,ió,ará— Spanish conjugation stems and endings. \n,.,,— separators.
If we train BPE and the top merges are nonsense, something is wrong with the corpus, the training, or both. The "top merges are interpretable" sanity check is a phase DoD item, not a vibe call.
Why we do not just use tiktoken¶
tiktoken is OpenAI's open-source BPE library — fast, correct, popular. Why not pip install tiktoken and skip Phase 11?
Three reasons:
- CLAUDE.md §0 hard-rule 4 (build before abstracting). The whole curriculum is "build, then read other people's code". Skipping the build skips the learning.
- Customization.
tiktokenships with GPT-2/4 vocabs. Our corpus is not English prose; it is a deliberately small grammatical matrix with Spanish pairs. Training our own vocab produces merges that match our data. The morpheme-level pattern (-s,-ed) is unlikely to be a top merge in GPT's vocab; it will be in ours. - Transparency. When something is wrong with how the model handles a specific input, you need to step into the tokenizer. Black-box library calls hide bugs.
In Phase 24 we read tiktoken's source as part of the "framework comparison" sub-phase. Our hand-built version is the reference we compare against.
Tokenization is a robustness boundary¶
The Phase 32 grammar-tutor agent receives a learner's sentence — possibly mis-spelled (wroked), possibly mis-conjugated (he goed), possibly bilingual (Yo work mañana). Every token it processes was produced by this tokenizer. Three robustness considerations:
- No tokenizer crashes. A malformed byte sequence must tokenize, not throw. Byte-level BPE guarantees this.
- Unicode confusables.
0(digit zero) vsO(capital letter O) vs GreekΟ(omicron). They look identical, render identically, but tokenize differently. The grammar agent must reason about visual identity vs token identity. We don't solve this; we expose it. - NFC/NFD ambiguity.
mañanacan be stored two ways: NFC (4 codepoints, one of them precomposedñ) or NFD (5 codepoints,n+ combining tilde). Both render identically; they tokenize differently at the byte level. Theory 03 addresses this explicitly.
These come up again in Phase 32. For now, byte-level BPE is the right choice because it leaves no tokenizer-level surprise behavior — all the surprises are in the model itself.
One-paragraph recap¶
A tokenizer maps strings to integer ID sequences and back. The choice constrains every later phase: vocab size sets the embedding matrix size, token granularity sets the sequence length and therefore attention compute. Three families: character (too long), word (out-of-vocab fatal), subword (Goldilocks). Among subwords, byte-level BPE is what GPT-2 and modern LLMs use because it is simple, deterministic, OOV-free, Unicode-robust, and handles our bilingual English-Spanish verb corpus without any extra mechanism. We build it from scratch; Phase 24 compares to tiktoken.
What this theory page does NOT cover¶
- The BPE algorithm itself. That's theory 02.
- Detailed byte / Unicode mechanics. Theory 03.
- Subword regularization (random sub-token dropout during training to make models robust to alternative tokenizations). A Phase 18+ concern.
- Vocabulary growth strategies (start small and grow, or fix at training time). We fix at training time; growing-vocab schemes are research-only.
Next: theory/01-character-word-subword.md.