English · Español
Phase 11 — Tokenization Theory + BPE Implementation¶
Requires: 10 — Initialization, Normalization, Residuals Teaches:
tokenization·bpe·byte-level·subwords·vocabulary·zipfJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Tokenizar es decidir el alfabeto del modelo. Aquí construimos un tokenizer BPE byte-level desde cero, entrenado sobre frases inglesas de conjugación verbal (con sus traducciones al español, §A13). Las fusiones más frecuentes deberían incluir morfemas como
-s,-ed,will,to,he,I.
Goal¶
Borja writes a pure-Python byte-level BPE tokenizer from scratch, trains it on a bootstrap English-verb corpus (later re-trained on the full Phase 12 corpus), and demonstrates exact round-trip invertibility on ASCII, Spanish accents, and emoji. The phase's headline artefact is the top-30 merges of the trained vocab — human-readable proof that the corpus shaped the tokenizer. Expect to see suffixes (-s, -ed, ing), function words (will, to), and person markers (I, he, you) emerge as top merges.
Read order¶
theory/00-motivation.md— why tokenization is the first irreversible compromise of any LM, and why the verb-grammar corpus drives the tokenizer's design.theory/01-character-word-subword.md— the three tokenizer families; why subword (BPE) wins for our bilingual verb corpus; what OOV would cost us.theory/02-bpe-algorithm.md— the BPE training and encoding algorithms, with a fully worked toy example over English verb sentences.theory/03-byte-level-and-unicode.md— bytes vs characters; UTF-8 of Spanish accents (á,ñ,¿); NFC vs NFD; emoji; the GPT-2 leading-space convention.lab/00-toy-bpe-by-hand.md— five merges on a 5-sentence English-verb toy corpus, on paper, before code.lab/01-implement-bpe.md— writesrc/minitoken/bpe.py. Train + encode + decode + property-based round-trip tests.lab/02-bpe-on-verb-corpus.md— train on the bootstrap verb corpus (and later the Phase 12 corpus) at three vocab sizes; produce Zipf plot + top-30 merges with morphology annotations.
solutions/ is empty during pre-write — populated at phase open.
Definition of Done¶
See PHASE_11_PLAN.md §6. Briefly:
src/minitoken/{bpe.py, vocab.py}mypy --strictclean, withtrain/encode/decode/save/load.- Property tests pass on 1k random UTF-8 inputs (
hypothesis) covering ASCII, Spanish range, and emoji. - Trained vocab on the verb corpus exposes morphology in its top merges:
-s,-ed,will,to,ing, plus at least 3 whole-verb single tokens. - Zipf plot of token frequencies committed.
- You can explain BPE training step-by-step with a 5-merge hand trace on the toy English verb corpus.
What this phase intentionally does NOT cover¶
- WordPiece (BERT-style) — survey only in theory 02.
- Unigram tokenizer (SentencePiece-style EM training) — survey only.
tiktokenortransformers.AutoTokenizer— explicitly forbidden by CLAUDE.md §0 hard rule 4 (notransformerslib before Phase 24). We build first, then read other people's code in Phase 24.- Pre-tokenization regex (the GPT-2
patthat splits on whitespace and punctuation before BPE). We operate on raw bytes. Discussed in theory but not implemented. - Tokenizer-level grammar awareness. BPE is statistical, not grammatical. It merges
-sbecause it is frequent, not because it is a morpheme. The model learns grammar from Phase 13 onward (embeddings + attention). - The full Phase 12 corpus. Phase 11 uses a small bootstrap corpus (≈ 30 hand-typed verb sentences in
data/raw/bootstrap-en.txt). The full 600-form corpus arrives in Phase 12 and the BPE is re-trained on it as the closing DoD item of Phase 12.
Phase 11's scope is the byte-level BPE tokenizer — the bridge between raw text and integer IDs. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Neural Machine Translation of Rare Words with Subword Units — Sennrich, Haddow, Birch · 2015. the original BPE paper you reimplement.
- ✍️ Let's build the GPT Tokenizer — Karpathy · 2024. byte-level BPE built live, end to end.