Skip to content

English · Español

00 — Why tokenization is the first compromise

🇪🇸 Un tokenizer convierte texto en una secuencia de IDs enteros — el alfabeto sobre el que el modelo opera. Cada elección (caracteres, palabras, sub-palabras, bytes) tiene un coste: cobertura, longitud de secuencia, manejo de errores. La elección filtra todo lo demás del proyecto.


What a tokenizer is

A tokenizer is a deterministic bijection (or near-bijection) between strings and finite sequences of integer IDs. The ID is what the model actually sees. The model's vocabulary \(V\) is the size of the token space.

Pseudo-formally:

encode : str → list[int]
decode : list[int] → str

with the property decode(encode(s)) ≈ s (exact for byte-level BPE; sometimes lossy for character-normalized tokenizers).

Two non-obvious consequences:

  1. The vocab size \(V\) controls model size. The embedding matrix is (V, d) and the output projection is (d, V). For \(d = 768\) and \(V = 50{,}000\): 38M params just for the embedding. Doubling \(V\) adds another 38M. The choice of \(V\) is a model-architecture decision, made here, before any model exists.

  2. The token length distribution controls sequence-length costs. A character-level tokenizer makes works into 5 tokens; a word-level tokenizer makes it 1; a subword tokenizer makes it 1 or 2 depending on training frequency. The transformer's attention is \(O(L^2)\) in sequence length, so a 5× longer tokenization is a 25× compute cost. The choice is a runtime decision.

The tokenizer is the project's first compromise. It cannot be re-decided later without retraining everything.

The three options (and the one we will build)

There are three families of tokenizers in modern use:

Family A — character-level

"I work"['I', ' ', 'w', 'o', 'r', 'k'][73, 32, 119, 111, 114, 107].

  • Pro: trivial. Never out-of-vocab.
  • Con: sequence length is 5–10× longer than other methods. Quadratic attention cost grows 25–100×. Compute-prohibitive.
  • Where used: very small character-level LMs, ByT5 (Google, 2021) — but not the mainstream LLM lineage.

Family B — word-level

"I worked"['I', ' ', 'worked'][12, 0, 487].

  • Pro: short sequences. Tokens carry semantic meaning.
  • Con: out-of-vocabulary problem is fatal. Misspell wroked['I', ' ', '<unk>']. The model cannot even see the misspelled input. For grammar correction (the Phase 32 agent task), this is a non-starter — the agent must see misspellings to correct them.
  • Where used: classical NLP pre-2017. Mostly retired.

Family C — subword / BPE

"he works"['he', ' work', 's'][58, 109, 23].

  • Pro: balances both. Sequences stay short (close to word-level for frequent forms). Never out-of-vocab if implemented at the byte level. The -s suffix and the work stem are separate tokens — the model can learn that the -s token, in this position, marks 3rd-person singular. That's exactly the structure our Phase 32 grammar-tutor agent needs.
  • Con: more code, more design choices.
  • Where used: every modern LLM. GPT-⅔/4, Llama, Claude, Mistral, etc.

We implement byte-level BPE — the variant where the base alphabet is the 256 possible byte values (not Unicode characters). This makes the tokenizer:

  1. OOV-free. Any byte string is tokenizable.
  2. Unicode-robust. Doesn't care about NFC/NFD normalization, BOM markers, weird encodings.
  3. Bilingual without ceremony. Spanish mañana (8 bytes in UTF-8) and English tomorrow (8 bytes) tokenize side-by-side in one shared vocab. No language tag needed.
  4. Misspelling-safe. Adversarial or accidental garbage bytes still tokenize.

The cost: the base alphabet is bytes, not characters. So mañana starts as 8 byte-singletons (the ñ is 2 bytes: 0xC3 0xB1). After BPE training on a Spanish-containing corpus, the bytes of ñ get merged into one symbol — and then maybe , aña, mañana itself if frequent enough.

For our English-verb-grammar corpus with paired Spanish translations (per §A13), byte-level is the only correct choice: it's the only family that handles the bilingual signal natively without any extra mechanism.

Why BPE specifically (and not other subword schemes)

The subword family has several variants:

  • BPE (Sennrich et al., 2016). Greedy merge by frequency. Simple, deterministic, fast.
  • WordPiece (Schuster & Nakajima, 2012). Like BPE but merges to maximize likelihood under a unigram model. Used by BERT.
  • Unigram (Kudo, 2018). Train a unigram language model, prune low-probability tokens. Used by SentencePiece / T5.

All three reach similar quality on benchmarks. We pick BPE because:

  1. The algorithm is the simplest of the three. Borja can implement it from scratch in ~150 lines of Python.
  2. It is what GPT-2 uses (and we borrow GPT-2's byte-level variant). Connects directly to modern open-source LLM code in Phase 24.
  3. It is deterministic. No EM training, no ambiguity. Reproducible from a seed.

WordPiece and Unigram are survey-only in theory 02; we don't implement them.

What the choice leaks into

  • Phase 12 corpus: the corpus is the data BPE trains on. Its design (20 verbs × 5 tenses × 3 persons + Spanish pairs) determines which merges win.
  • Phase 13 embeddings: the embedding matrix is (V, d) where V is set here.
  • Phase 15 attention: the input sequence length depends on tokenization granularity. A 6-word sentence becomes ~6–10 tokens; attention cost is then ~36–100 inner products.
  • Phase 18 training: the cross-entropy loss is over \(V\) classes; small \(V\) is cheap per step.
  • Phase 22 inference: generation samples one token at a time; longer tokens (e.g., works as one token) mean fewer samples per character of output.
  • Phase 32 grammar-tutor agent: the agent will receive a learner's misspelled or mis-conjugated sentence; byte-level BPE guarantees it always tokenizes.

Every phase from 11 onward is downstream of this one. Choose well.

A quick semantic check (the post-training sanity gate)

After training BPE on our English-verb corpus with Spanish pairs, the top merges should be interpretable as morphologically or distributionally meaningful substrings. Expected wins (the lab DoD checks for these):

  • work, play, walk, listen — frequent verb stems with their leading space.
  • -s (after stems) — 3rd-person singular present marker.
  • -ed — regular past simple marker.
  • will — simple future marker.
  • togoing to future marker and infinitive marker.
  • he, she, I, you, it — pronouns.
  • For the Spanish side: trab, come, ar, er, , ará — Spanish conjugation stems and endings.
  • \n, ., , — separators.

If we train BPE and the top merges are nonsense, something is wrong with the corpus, the training, or both. The "top merges are interpretable" sanity check is a phase DoD item, not a vibe call.

Why we do not just use tiktoken

tiktoken is OpenAI's open-source BPE library — fast, correct, popular. Why not pip install tiktoken and skip Phase 11?

Three reasons:

  1. CLAUDE.md §0 hard-rule 4 (build before abstracting). The whole curriculum is "build, then read other people's code". Skipping the build skips the learning.
  2. Customization. tiktoken ships with GPT-2/4 vocabs. Our corpus is not English prose; it is a deliberately small grammatical matrix with Spanish pairs. Training our own vocab produces merges that match our data. The morpheme-level pattern (-s, -ed) is unlikely to be a top merge in GPT's vocab; it will be in ours.
  3. Transparency. When something is wrong with how the model handles a specific input, you need to step into the tokenizer. Black-box library calls hide bugs.

In Phase 24 we read tiktoken's source as part of the "framework comparison" sub-phase. Our hand-built version is the reference we compare against.

Tokenization is a robustness boundary

The Phase 32 grammar-tutor agent receives a learner's sentence — possibly mis-spelled (wroked), possibly mis-conjugated (he goed), possibly bilingual (Yo work mañana). Every token it processes was produced by this tokenizer. Three robustness considerations:

  1. No tokenizer crashes. A malformed byte sequence must tokenize, not throw. Byte-level BPE guarantees this.
  2. Unicode confusables. 0 (digit zero) vs O (capital letter O) vs Greek Ο (omicron). They look identical, render identically, but tokenize differently. The grammar agent must reason about visual identity vs token identity. We don't solve this; we expose it.
  3. NFC/NFD ambiguity. mañana can be stored two ways: NFC (4 codepoints, one of them precomposed ñ) or NFD (5 codepoints, n + combining tilde). Both render identically; they tokenize differently at the byte level. Theory 03 addresses this explicitly.

These come up again in Phase 32. For now, byte-level BPE is the right choice because it leaves no tokenizer-level surprise behavior — all the surprises are in the model itself.

One-paragraph recap

A tokenizer maps strings to integer ID sequences and back. The choice constrains every later phase: vocab size sets the embedding matrix size, token granularity sets the sequence length and therefore attention compute. Three families: character (too long), word (out-of-vocab fatal), subword (Goldilocks). Among subwords, byte-level BPE is what GPT-2 and modern LLMs use because it is simple, deterministic, OOV-free, Unicode-robust, and handles our bilingual English-Spanish verb corpus without any extra mechanism. We build it from scratch; Phase 24 compares to tiktoken.

What this theory page does NOT cover

  • The BPE algorithm itself. That's theory 02.
  • Detailed byte / Unicode mechanics. Theory 03.
  • Subword regularization (random sub-token dropout during training to make models robust to alternative tokenizations). A Phase 18+ concern.
  • Vocabulary growth strategies (start small and grow, or fix at training time). We fix at training time; growing-vocab schemes are research-only.

Next: theory/01-character-word-subword.md.