English · Español

00 — Why tokenization is the first compromise¶

🇪🇸 Un tokenizer convierte texto en una secuencia de IDs enteros — el alfabeto sobre el que el modelo opera. Cada elección (caracteres, palabras, sub-palabras, bytes) tiene un coste: cobertura, longitud de secuencia, manejo de errores. La elección filtra todo lo demás del proyecto.

What a tokenizer is¶

A tokenizer is a deterministic bijection (or near-bijection) between strings and finite sequences of integer IDs. The ID is what the model actually sees. The model's vocabulary \(V\) is the size of the token space.

Pseudo-formally:

encode : str → list[int]
decode : list[int] → str

with the property decode(encode(s)) ≈ s (exact for byte-level BPE; sometimes lossy for character-normalized tokenizers).

Two non-obvious consequences:

The vocab size \(V\) controls model size. The embedding matrix is (V, d) and the output projection is (d, V). For \(d = 768\) and \(V = 50{,}000\): 38M params just for the embedding. Doubling \(V\) adds another 38M. The choice of \(V\) is a model-architecture decision, made here, before any model exists.
The token length distribution controls sequence-length costs. A character-level tokenizer makes works into 5 tokens; a word-level tokenizer makes it 1; a subword tokenizer makes it 1 or 2 depending on training frequency. The transformer's attention is \(O(L^2)\) in sequence length, so a 5× longer tokenization is a 25× compute cost. The choice is a runtime decision.

The tokenizer is the project's first compromise. It cannot be re-decided later without retraining everything.

The three options (and the one we will build)¶

There are three families of tokenizers in modern use:

Family A — character-level¶

"I work" → ['I', ' ', 'w', 'o', 'r', 'k'] → [73, 32, 119, 111, 114, 107].

Pro: trivial. Never out-of-vocab.
Con: sequence length is 5–10× longer than other methods. Quadratic attention cost grows 25–100×. Compute-prohibitive.
Where used: very small character-level LMs, ByT5 (Google, 2021) — but not the mainstream LLM lineage.

Family B — word-level¶

"I worked" → ['I', ' ', 'worked'] → [12, 0, 487].

Pro: short sequences. Tokens carry semantic meaning.
Con: out-of-vocabulary problem is fatal. Misspell wroked → ['I', ' ', '<unk>']. The model cannot even see the misspelled input. For grammar correction (the Phase 32 agent task), this is a non-starter — the agent must see misspellings to correct them.
Where used: classical NLP pre-2017. Mostly retired.

Family C — subword / BPE¶

"he works" → ['he', ' work', 's'] → [58, 109, 23].

Pro: balances both. Sequences stay short (close to word-level for frequent forms). Never out-of-vocab if implemented at the byte level. The -s suffix and the work stem are separate tokens — the model can learn that the -s token, in this position, marks 3^rd-person singular. That's exactly the structure our Phase 32 grammar-tutor agent needs.
Con: more code, more design choices.
Where used: every modern LLM. GPT-⅔/4, Llama, Claude, Mistral, etc.

We implement byte-level BPE — the variant where the base alphabet is the 256 possible byte values (not Unicode characters). This makes the tokenizer:

OOV-free. Any byte string is tokenizable.
Unicode-robust. Doesn't care about NFC/NFD normalization, BOM markers, weird encodings.
Bilingual without ceremony. Spanish mañana (8 bytes in UTF-8) and English tomorrow (8 bytes) tokenize side-by-side in one shared vocab. No language tag needed.
Misspelling-safe. Adversarial or accidental garbage bytes still tokenize.

The cost: the base alphabet is bytes, not characters. So mañana starts as 8 byte-singletons (the ñ is 2 bytes: 0xC3 0xB1). After BPE training on a Spanish-containing corpus, the bytes of ñ get merged into one symbol — and then maybe añ, aña, mañana itself if frequent enough.

For our English-verb-grammar corpus with paired Spanish translations (per §A13), byte-level is the only correct choice: it's the only family that handles the bilingual signal natively without any extra mechanism.

Why BPE specifically (and not other subword schemes)¶

The subword family has several variants:

BPE (Sennrich et al., 2016). Greedy merge by frequency. Simple, deterministic, fast.
WordPiece (Schuster & Nakajima, 2012). Like BPE but merges to maximize likelihood under a unigram model. Used by BERT.
Unigram (Kudo, 2018). Train a unigram language model, prune low-probability tokens. Used by SentencePiece / T5.

All three reach similar quality on benchmarks. We pick BPE because:

The algorithm is the simplest of the three. Borja can implement it from scratch in ~150 lines of Python.
It is what GPT-2 uses (and we borrow GPT-2's byte-level variant). Connects directly to modern open-source LLM code in Phase 24.
It is deterministic. No EM training, no ambiguity. Reproducible from a seed.

WordPiece and Unigram are survey-only in theory 02; we don't implement them.

What the choice leaks into¶

Phase 12 corpus: the corpus is the data BPE trains on. Its design (20 verbs × 5 tenses × 3 persons + Spanish pairs) determines which merges win.
Phase 13 embeddings: the embedding matrix is (V, d) where V is set here.
Phase 15 attention: the input sequence length depends on tokenization granularity. A 6-word sentence becomes ~6–10 tokens; attention cost is then ~36–100 inner products.
Phase 18 training: the cross-entropy loss is over \(V\) classes; small \(V\) is cheap per step.
Phase 22 inference: generation samples one token at a time; longer tokens (e.g., works as one token) mean fewer samples per character of output.
Phase 32 grammar-tutor agent: the agent will receive a learner's misspelled or mis-conjugated sentence; byte-level BPE guarantees it always tokenizes.

Every phase from 11 onward is downstream of this one. Choose well.

A quick semantic check (the post-training sanity gate)¶

After training BPE on our English-verb corpus with Spanish pairs, the top merges should be interpretable as morphologically or distributionally meaningful substrings. Expected wins (the lab DoD checks for these):

work, play, walk, listen — frequent verb stems with their leading space.
-s (after stems) — 3^rd-person singular present marker.
-ed — regular past simple marker.
will — simple future marker.
to — going to future marker and infinitive marker.
he, she, I, you, it — pronouns.
For the Spanish side: trab, come, ar, er, ió, ará — Spanish conjugation stems and endings.
\n, ., , — separators.

If we train BPE and the top merges are nonsense, something is wrong with the corpus, the training, or both. The "top merges are interpretable" sanity check is a phase DoD item, not a vibe call.

Why we do not just use `tiktoken`¶

tiktoken is OpenAI's open-source BPE library — fast, correct, popular. Why not pip install tiktoken and skip Phase 11?

Three reasons:

CLAUDE.md §0 hard-rule 4 (build before abstracting). The whole curriculum is "build, then read other people's code". Skipping the build skips the learning.
Customization. tiktoken ships with GPT-2/4 vocabs. Our corpus is not English prose; it is a deliberately small grammatical matrix with Spanish pairs. Training our own vocab produces merges that match our data. The morpheme-level pattern (-s, -ed) is unlikely to be a top merge in GPT's vocab; it will be in ours.
Transparency. When something is wrong with how the model handles a specific input, you need to step into the tokenizer. Black-box library calls hide bugs.

In Phase 24 we read tiktoken's source as part of the "framework comparison" sub-phase. Our hand-built version is the reference we compare against.

Tokenization is a robustness boundary¶

The Phase 32 grammar-tutor agent receives a learner's sentence — possibly mis-spelled (wroked), possibly mis-conjugated (he goed), possibly bilingual (Yo work mañana). Every token it processes was produced by this tokenizer. Three robustness considerations:

No tokenizer crashes. A malformed byte sequence must tokenize, not throw. Byte-level BPE guarantees this.
Unicode confusables. 0 (digit zero) vs O (capital letter O) vs Greek Ο (omicron). They look identical, render identically, but tokenize differently. The grammar agent must reason about visual identity vs token identity. We don't solve this; we expose it.
NFC/NFD ambiguity. mañana can be stored two ways: NFC (4 codepoints, one of them precomposed ñ) or NFD (5 codepoints, n + combining tilde). Both render identically; they tokenize differently at the byte level. Theory 03 addresses this explicitly.

These come up again in Phase 32. For now, byte-level BPE is the right choice because it leaves no tokenizer-level surprise behavior — all the surprises are in the model itself.

One-paragraph recap¶

A tokenizer maps strings to integer ID sequences and back. The choice constrains every later phase: vocab size sets the embedding matrix size, token granularity sets the sequence length and therefore attention compute. Three families: character (too long), word (out-of-vocab fatal), subword (Goldilocks). Among subwords, byte-level BPE is what GPT-2 and modern LLMs use because it is simple, deterministic, OOV-free, Unicode-robust, and handles our bilingual English-Spanish verb corpus without any extra mechanism. We build it from scratch; Phase 24 compares to tiktoken.

What this theory page does NOT cover¶

The BPE algorithm itself. That's theory 02.
Detailed byte / Unicode mechanics. Theory 03.
Subword regularization (random sub-token dropout during training to make models robust to alternative tokenizations). A Phase 18+ concern.
Vocabulary growth strategies (start small and grow, or fix at training time). We fix at training time; growing-vocab schemes are research-only.

Next: theory/01-character-word-subword.md.