English · Español

01 — Character vs Word vs Subword: the OOV question¶

🇪🇸 Tres familias de tokenizadores: caracteres (cortos, secuencias largas), palabras (semánticos pero con problema OOV mortal), y sub-palabras (el equilibrio que ganó). Aquí formalizamos el trade-off y explicamos por qué la familia de sub-palabras es la elección obvia en 2026, especialmente para un corpus bilingüe de conjugaciones verbales.

The trade-off, formalized¶

For a fixed corpus of total character count \(C\) and a vocabulary of size \(V\):

Family	Tokens per character (approx)	OOV behavior	Vocab size
Character	1	None (every char in vocab)	\(\approx 100\) (ASCII), \(\approx 10^5\) (Unicode)
Word	\(1/L_w\) where \(L_w\) is avg word length (~5 in English)	`<unk>` for any unseen word	\(10^5\) – \(10^6\) for English
Subword (BPE)	\(1/L_t\) where \(L_t\) is avg token length (~3-4 chars)	None if byte-level	\(10^3\) – \(10^5\)

The trade is vocab size vs sequence length vs OOV robustness. We want low sequence length (cheap attention), small vocab (compact embedding), and zero OOV (robust input). Subword wins all three vs word; subword wins sequence-length vs character; subword's only loss is vocab > character. The vocab cost is fixed and small.

The out-of-vocabulary (OOV) failure¶

Word-level tokenizers split text on whitespace + punctuation. The vocab is the union of all words seen during training (or all words within some frequency threshold). Any input word not in the vocab becomes a special <unk> token.

Why OOV is fatal for our use case¶

Phase 32's grammar-tutor agent receives a learner's sentence — which is by construction likely to contain errors. Learners will:

Mis-conjugate verbs: he goed, she eated, I am go to work.
Mis-spell: wroked (instead of worked), studyed (instead of studied).
Mix languages: Yo work mañana.

Every one of these becomes <unk> under word-level tokenization. The model literally cannot distinguish goed, eated, and studyed — they are all the same <unk>. That is catastrophic: the agent is supposed to correct these errors, but with word-level tokenization it cannot even see them.

Subword tokenizers do not have this problem. goed becomes [' go', 'ed'] — the model still sees the go stem (which it has learned in context of go / goes / went / gone / going), notices the (incorrectly) appended -ed suffix, and can flag the conjugation error. eated becomes [' eat', 'ed'] — same story.

Character-level avoids OOV but pays in sequence length¶

A character-level tokenizer trivially has every character in its vocab. No OOV. But:

A 60-character sentence becomes a 60-token sequence.
Attention is \(O(L^2)\) → \(3.6 \times 10^3\) ops per sentence.
Same sentence as subword tokens (assume avg 4 chars/token) → 15 tokens → \(225\) ops. 16× cheaper.

For Phase 18 training on the Phase 12 corpus (~600 forms / sentences, avg 30–50 chars each), the subword choice keeps an epoch in the minutes-range on CPU.

What "subword" actually means¶

A subword is a string that's neither a single character nor a whole word — usually a frequently-occurring prefix, suffix, or stem. Examples for English verbs:

-ing — present participle / continuous form (working, playing, going).
-ed — regular past simple (worked, played).
-s — 3^rd-person singular present (works, goes, eats).
will — simple future marker.
going to — periphrastic future (often split as going + to).
he, she, I, you, it — pronouns.

For our English+Spanish verb corpus, expected subwords:

English stems: work, play, walk, listen, watch, study, be, have, go, come, eat, write.
English suffixes: -s, -ed, -ing.
Spanish stems: trab, jug, cam, habl, escuch, mir, estudi.
Spanish endings: -ar, -er, -ir, -o, -as, -a, -amos, -an, -ó, -ará.
Punctuation and whitespace: , ., ,, \n.

BPE discovers these subwords automatically by counting and merging frequent pairs. We hard-code nothing.

Subword tokenizers in 2026¶

The three commonly used subword tokenizers:

BPE (Byte-Pair Encoding)¶

Originally a compression algorithm (Gage, 1994); Sennrich et al. (2016) adapted it for NLP. Greedy: at each step, find the most frequent adjacent pair and merge them. Repeat until vocab is full.

Determinism: with a fixed corpus and a fixed tie-breaking rule, BPE is fully deterministic.
Speed: training is \(O(V \times N)\) in the naive version; \(O(V \log N + N)\) with priority queues.
Models using it: GPT-⅔/4 (byte-level BPE), Llama (byte-level BPE via SentencePiece's BPE mode), Roberta.

We build this.

WordPiece¶

Like BPE, but the merge criterion is likelihood maximization under a unigram language model, not raw frequency. The pair with the highest likelihood ratio merges next.

Determinism: same as BPE.
Speed: similar.
Models using it: BERT, DistilBERT.

Survey-only.

Unigram (SentencePiece)¶

A probabilistic approach: maintain a vocabulary of subword candidates, train a unigram LM over them via EM, and prune low-probability tokens.

Determinism: stochastic; multiple random restarts during training.
Speed: slower than BPE.
Models using it: T5, mT5, XLM-R, ALBERT.

Survey-only.

Comparison summary¶

Property	BPE	WordPiece	Unigram
Algorithm complexity	Simplest	Medium	Highest
Training determinism	Yes (with tie-breaking)	Yes	No (EM restarts)
Empirical quality	Comparable	Comparable	Comparable
Library support	`tiktoken`, `transformers`	`transformers` (BERT)	`sentencepiece`
Our pick	✅	—	—

All three reach similar quality on standard benchmarks (Bostrom & Durrett, 2020). We pick the simplest.

Pre-tokenization: a subtle prerequisite¶

Most production BPE implementations do pre-tokenization before BPE. They first split the input on whitespace and punctuation using a regex, then run BPE within each pre-token.

GPT-2's pat (the famous regex):

's|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

This is doing two things:

Word boundary respect. BPE won't merge across "foo " "bar" boundaries, even if o b is frequent.
Whitespace handling. The leading prefix marks word-initial subwords differently from word-internal ones.

For our verb-grammar corpus, we skip pre-tokenization regex in v1 because:

The corpus is small and clean. Sentences are short, well-formed, and use ASCII whitespace identically across English and Spanish.
BPE on raw bytes will discover the right boundaries (the space byte 0x20 will dominate as a frequent pair-component).
Pre-tokenization adds a Unicode-property regex dependency that would obscure the core algorithm.

We do keep one convention: leading space matters. work (with leading space) is conceptually different from work (word-internal). BPE on raw bytes treats them as different byte sequences, which is exactly the GPT-2 behaviour without the regex sentinel. Note this in the BLUEPRINT; revisit if the trained vocab does weird things.

Why Phase 11 lab does the toy by-hand first¶

Lab 00 is a paper-and-pencil BPE trace on a 5-sentence toy corpus drawn from our verb-grammar topic:

I work
you work
he works
I worked
he worked

Borja merges 5 pairs by hand, writes the pair-count table at each step, and tracks the resulting vocab. The expected first merges include w + o → wo, then wo + r → wor, eventually work. The -s after works and the -ed after worked should also surface.

Why hand-trace first? BPE training looks simple on paper but has fiddly implementation details:

Counts update incrementally after each merge (don't recount everything in production).
Tie-breaking matters for reproducibility.
Special tokens require explicit guards.

Doing it by hand once means the code in lab 01 isn't a black box.

Drill problems¶

Solutions in solutions/01-character-word-subword-ref.md (phase open).

Estimate the vocab size of an English word-level tokenizer trained on all of Wikipedia. (Hint: Heaps' law; OEIS-ballpark \(10^5\) – \(10^6\).)
For a typical English text, what's the ratio (BPE tokens) / (characters)? (Hint: GPT-3 averages ~4 characters per token.) Why does that ratio differ for our verb corpus (short sentences, high morpheme repetition)?
A character-level tokenizer over our corpus would have vocab ≈ 80 (ASCII printable + Spanish accents). A byte-level BPE with vocab 1024 would have ~13× more tokens. Why pay 13× the embedding cost for a tokenizer that produces shorter sequences? (Hint: embedding cost vs attention cost, with attention being \(O(L^2)\).)
Given the corpus contains 20 English verbs and 20 Spanish verbs, how many distinct stem tokens should we expect BPE to discover at vocab size 512? (Plausible answer: ~30–40 stems, plus suffix tokens; the rest is whitespace and punctuation. Verify in lab 02.)

What this theory page does NOT cover¶

The BPE algorithm step-by-step. Theory 02.
Byte-level mechanics + UTF-8 of Spanish. Theory 03.
Re-training BPE when the corpus grows. Phase 12 closes by re-training on the full corpus; mechanics deferred there.
Tokenizer evaluation metrics (compression ratio, average token length). Mentioned in passing in lab 02; Phase 20 (evaluation harness) does not include tokenizer-specific metrics.

One-paragraph recap¶

Three families: character (no OOV, too long), word (semantic but OOV-fatal), subword (the equilibrium). Among subwords, BPE is the simplest deterministic option and what GPT-2 / Llama use. Our corpus is bilingual verb conjugations; byte-level BPE is OOV-free, Unicode-robust, and discovers morphology-rich merges (-s, -ed, will, stems) automatically. We skip pre-tokenization regex in v1 because the corpus is small and clean.

Next: theory/02-bpe-algorithm.md.