English · Español
Lab 01 — Implement BPE in pure Python¶
Goal: write
src/minitoken/bpe.pyfrom scratch. Train + encode + decode + save/load + round-trip property tests.Estimated time: 4–6 hours, possibly across multiple sessions.
Prereq: lab 00 committed;
src/minitoken/BLUEPRINT.mdreviewed and refined per A12.
What you produce¶
src/minitoken/__init__.py— re-exportsBPETokenizer,Vocab.src/minitoken/bpe.py— the trainer + encoder + decoder.src/minitoken/vocab.py—Vocabdataclass + serialization helpers.tests/test_bpe.py— unit + property tests.tests/test_bpe.md(or a top-of-file docstring) — the test list written before the tests, per CLAUDE.md §1.
All four files must pass mypy --strict and ruff linting.
Test list (write this FIRST, per CLAUDE.md §1)¶
Before any implementation, write the list of tests you intend to write — as comments or in a markdown file. The list should include at minimum:
-
test_train_minimal_corpus— train on the lab-00 toy corpus, verify the first 5 merges match the hand trace. -
test_encode_decode_roundtrip_ascii— for 20 hand-picked ASCII strings,decode(encode(s)) == s. -
test_encode_decode_roundtrip_unicode— multibyte UTF-8 covering Spanish accents ("mañana","¿cómo estás?") and emoji ("🦊"). -
test_encode_decode_property(hypothesis) — 1000 random utf-8 strings (ASCII + Spanish range + emoji), round-trip. -
test_determinism— train twice with same seed → identical vocab + merges (byte-equal). -
test_tiebreak— fixture corpus with a known pair-count tie; verify lex-ascending tie-break is applied. -
test_save_load_roundtrip—BPETokenizer.load(t.save(path)).vocab == t.vocabbyte-identical. -
test_empty_string—encode("") == [],decode([]) == "". -
test_single_byte—encode("a")returns a length-1 list. -
test_special_tokens_atomic— special tokens are never merged across. -
test_vocab_size_target_reached— train withvocab_size=300produces exactly 300 entries (256 bytes + specials + merges). -
test_out_of_bounds_id—decode([99999])raises a clear error. -
test_mypy_strict— implicit (the CI step catches this).
TODOs¶
Block A — src/minitoken/vocab.py¶
- Define
Vocabas a frozen dataclass withid_to_bytes: tuple[bytes, ...],bytes_to_id: dict[bytes, int],id_to_display: tuple[str, ...]. - Add
Vocab.save(path)→ writesvocab.json(id ↔ display ↔ hex(bytes) for human inspection). - Add
Vocab.load(path)→ reverse. - Unit tests: save → load → equality.
Block B — src/minitoken/bpe.py: training¶
-
class BPETokenizerwith__init__(self). -
_reserve_specials_and_bytes(special_tokens)— populate IDs 0..|S|-1 + 0..255 of base bytes. -
_corpus_to_pretokens(corpus)— convert each string to a tuple ofbytes-singletons; deduplicate to a Counter. -
_count_pairs(pretokens)— returndict[(bytes, bytes), int]. -
_pick_winner(pair_counts)— implementmin(..., key=lambda p: (-count, p))(max count, then lexicographically smaller pair). Document the tie-break. -
_apply_merge(pretokens, winner)— return new pretokens dict with the pair replaced. -
train(corpus, vocab_size, special_tokens, verbose)— the main loop. - Log a progress bar if
verbose=True.
Block C — src/minitoken/bpe.py: encode/decode¶
-
encode(text: str) -> list[int]: - UTF-8 encode the text.
- Split into byte-singletons.
- For each merge in order, replace
(a, b)withab. - Map symbols to IDs.
-
decode(ids: list[int]) -> str: - Map IDs to bytes; concatenate.
- UTF-8 decode (errors='replace').
Block D — save/load¶
-
save(path: Path): vocab.json(viaVocab.save).merges.txt(one merge per line,bytes(a).hex() bytes(b).hex()).config.json({vocab_size, special_tokens, version}).-
load(path: Path): - Read all three files.
- Reconstruct
BPETokenizer.
Block E — tests (tests/test_bpe.py)¶
- Implement every test from the list above.
- Use
hypothesisfor the property test. - All tests pass.
-
pytest --cov src/minitoken≥ 90%.
Block F — sanity check on toy corpus¶
- At the bottom of
bpe.pyor inexperiments/11-toy-train/, runBPETokenizer().train(["I work"]*2 + ["I worked", "he works", "he worked"], vocab_size=261)(256 bytes + 5 merges). - Print the merges; verify they match your lab 00 hand-trace exactly.
Constraints¶
- Pure Python + NumPy only. No
tiktoken, notransformers, nosentencepiece— per CLAUDE.md §0 hard rule 4. mypy --strictclean.ruffclean.banditclean — noeval, nosubprocessshells withoutshell=False, etc.- Determinism enforced via the
tests/conftest.pyseed fixture. - No regex pre-tokenization in v1. Raw bytes only.
- No multibyte symbols in
merges.txtlines. Serialize as hex.
Stop conditions¶
Done when:
- All four files committed.
- All tests pass;
pytest -qis green. mypy --strict src/minitoken/clean.- The toy-corpus trace from lab 00 reproduces exactly via your trainer.
pytest -k "property" --hypothesis-seed=42passes on 1000 random utf-8 inputs.
Pitfalls¶
- Hash-iteration order changing your tie-break. Python 3.7+ has insertion-ordered dicts, but
max(dict, key=...)iterates in insertion order if multiple keys tie. Make sure the tie-break is on the value(count, pair)tuple, not relying on iteration order. - The merge loop is too slow. Naive recount-each-iteration is fine for our 300 KiB corpus but agonizing for testing if you bump vocab_size to 16k. Optimize: incremental pair-count updates (subtract counts of the merged pair's neighbors, add counts of the new pair's neighbors). Optional; OK to leave for the BLUEPRINT's "open question".
- Round-trip fails on
\xffbyte. Likely an encoding issue at save/load — make sure you serialize bytes as hex, not as a Pythonrepr. hypothesisfinds a counterexample. The most common: a Unicode surrogate pair (a U+D800–U+DFFF code point in isolation). Valid UTF-16 has these in pairs; valid UTF-8 doesn't allow them. Decision: filter them from the property-test strategy, or usehypothesis.strategies.text(alphabet=...)to restrict to valid UTF-8.merges.txtline format wrong. Common: spaces in the byte hex break the parser. Use a fixed separator like\t.
Hint of last resort¶
If 4 hours in and the round-trip property test fails: print the first 5 failures hypothesis finds. Read the bytes. The bug is almost always in encoding (the merge application step), not decoding.
When to consult solutions/¶
After all four files committed and tests green. Solution: solutions/01-implement-bpe-ref.md (phase open) contains a reference implementation with line-by-line annotations.
Next lab: lab/02-bpe-on-verb-corpus.md.