Skip to content

English · Español

Phase 11 — Tokenization Theory + BPE Implementation

Requires: 10 — Initialization, Normalization, Residuals Teaches: tokenization · bpe · byte-level · subwords · vocabulary · zipf Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. Theory and lab statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Tokenizar es decidir el alfabeto del modelo. Aquí construimos un tokenizer BPE byte-level desde cero, entrenado sobre frases inglesas de conjugación verbal (con sus traducciones al español, §A13). Las fusiones más frecuentes deberían incluir morfemas como -s, -ed, will, to, he, I.


Goal

Borja writes a pure-Python byte-level BPE tokenizer from scratch, trains it on a bootstrap English-verb corpus (later re-trained on the full Phase 12 corpus), and demonstrates exact round-trip invertibility on ASCII, Spanish accents, and emoji. The phase's headline artefact is the top-30 merges of the trained vocab — human-readable proof that the corpus shaped the tokenizer. Expect to see suffixes (-s, -ed, ing), function words (will, to), and person markers (I, he, you) emerge as top merges.

Read order

  1. theory/00-motivation.md — why tokenization is the first irreversible compromise of any LM, and why the verb-grammar corpus drives the tokenizer's design.
  2. theory/01-character-word-subword.md — the three tokenizer families; why subword (BPE) wins for our bilingual verb corpus; what OOV would cost us.
  3. theory/02-bpe-algorithm.md — the BPE training and encoding algorithms, with a fully worked toy example over English verb sentences.
  4. theory/03-byte-level-and-unicode.md — bytes vs characters; UTF-8 of Spanish accents (á, ñ, ¿); NFC vs NFD; emoji; the GPT-2 leading-space convention.
  5. lab/00-toy-bpe-by-hand.md — five merges on a 5-sentence English-verb toy corpus, on paper, before code.
  6. lab/01-implement-bpe.md — write src/minitoken/bpe.py. Train + encode + decode + property-based round-trip tests.
  7. lab/02-bpe-on-verb-corpus.md — train on the bootstrap verb corpus (and later the Phase 12 corpus) at three vocab sizes; produce Zipf plot + top-30 merges with morphology annotations.

solutions/ is empty during pre-write — populated at phase open.

Definition of Done

See PHASE_11_PLAN.md §6. Briefly:

  • src/minitoken/{bpe.py, vocab.py} mypy --strict clean, with train / encode / decode / save / load.
  • Property tests pass on 1k random UTF-8 inputs (hypothesis) covering ASCII, Spanish range, and emoji.
  • Trained vocab on the verb corpus exposes morphology in its top merges: -s, -ed, will, to, ing, plus at least 3 whole-verb single tokens.
  • Zipf plot of token frequencies committed.
  • You can explain BPE training step-by-step with a 5-merge hand trace on the toy English verb corpus.

What this phase intentionally does NOT cover

  • WordPiece (BERT-style) — survey only in theory 02.
  • Unigram tokenizer (SentencePiece-style EM training) — survey only.
  • tiktoken or transformers.AutoTokenizer — explicitly forbidden by CLAUDE.md §0 hard rule 4 (no transformers lib before Phase 24). We build first, then read other people's code in Phase 24.
  • Pre-tokenization regex (the GPT-2 pat that splits on whitespace and punctuation before BPE). We operate on raw bytes. Discussed in theory but not implemented.
  • Tokenizer-level grammar awareness. BPE is statistical, not grammatical. It merges -s because it is frequent, not because it is a morpheme. The model learns grammar from Phase 13 onward (embeddings + attention).
  • The full Phase 12 corpus. Phase 11 uses a small bootstrap corpus (≈ 30 hand-typed verb sentences in data/raw/bootstrap-en.txt). The full 600-form corpus arrives in Phase 12 and the BPE is re-trained on it as the closing DoD item of Phase 12.

Phase 11's scope is the byte-level BPE tokenizer — the bridge between raw text and integer IDs. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.