English · Español

Lab 01 — Implement BPE in pure Python¶

Goal: write src/minitoken/bpe.py from scratch. Train + encode + decode + save/load + round-trip property tests.

Estimated time: 4–6 hours, possibly across multiple sessions.

Prereq: lab 00 committed; src/minitoken/BLUEPRINT.md reviewed and refined per A12.

What you produce¶

src/minitoken/__init__.py — re-exports BPETokenizer, Vocab.
src/minitoken/bpe.py — the trainer + encoder + decoder.
src/minitoken/vocab.py — Vocab dataclass + serialization helpers.
tests/test_bpe.py — unit + property tests.
tests/test_bpe.md (or a top-of-file docstring) — the test list written before the tests, per CLAUDE.md §1.

All four files must pass mypy --strict and ruff linting.

Test list (write this FIRST, per CLAUDE.md §1)¶

Before any implementation, write the list of tests you intend to write — as comments or in a markdown file. The list should include at minimum:

TODOs¶

Block A — `src/minitoken/vocab.py`¶

Define Vocab as a frozen dataclass with id_to_bytes: tuple[bytes, ...], bytes_to_id: dict[bytes, int], id_to_display: tuple[str, ...].
Add Vocab.save(path) → writes vocab.json (id ↔ display ↔ hex(bytes) for human inspection).
Add Vocab.load(path) → reverse.
Unit tests: save → load → equality.

Block B — `src/minitoken/bpe.py`: training¶

class BPETokenizer with __init__(self).
_reserve_specials_and_bytes(special_tokens) — populate IDs 0..|S|-1 + 0..255 of base bytes.
_corpus_to_pretokens(corpus) — convert each string to a tuple of bytes-singletons; deduplicate to a Counter.
_count_pairs(pretokens) — return dict[(bytes, bytes), int].
_pick_winner(pair_counts) — implement min(..., key=lambda p: (-count, p)) (max count, then lexicographically smaller pair). Document the tie-break.
_apply_merge(pretokens, winner) — return new pretokens dict with the pair replaced.
train(corpus, vocab_size, special_tokens, verbose) — the main loop.
Log a progress bar if verbose=True.

Block C — `src/minitoken/bpe.py`: encode/decode¶

encode(text: str) -> list[int]:
UTF-8 encode the text.
Split into byte-singletons.
For each merge in order, replace (a, b) with ab.
Map symbols to IDs.
decode(ids: list[int]) -> str:
Map IDs to bytes; concatenate.
UTF-8 decode (errors='replace').

Block D — save/load¶

save(path: Path):
vocab.json (via Vocab.save).
merges.txt (one merge per line, bytes(a).hex() bytes(b).hex()).
config.json ({vocab_size, special_tokens, version}).
load(path: Path):
Read all three files.
Reconstruct BPETokenizer.

Block E — tests (`tests/test_bpe.py`)¶

Implement every test from the list above.
Use hypothesis for the property test.
All tests pass.
pytest --cov src/minitoken ≥ 90%.

Block F — sanity check on toy corpus¶

At the bottom of bpe.py or in experiments/11-toy-train/, run BPETokenizer().train(["I work"]*2 + ["I worked", "he works", "he worked"], vocab_size=261) (256 bytes + 5 merges).
Print the merges; verify they match your lab 00 hand-trace exactly.

Constraints¶

Pure Python + NumPy only. No tiktoken, no transformers, no sentencepiece — per CLAUDE.md §0 hard rule 4.
mypy --strict clean.
ruff clean.
bandit clean — no eval, no subprocess shells without shell=False, etc.
Determinism enforced via the tests/conftest.py seed fixture.
No regex pre-tokenization in v1. Raw bytes only.
No multibyte symbols in merges.txt lines. Serialize as hex.

Stop conditions¶

Done when:

All four files committed.
All tests pass; pytest -q is green.
mypy --strict src/minitoken/ clean.
The toy-corpus trace from lab 00 reproduces exactly via your trainer.
pytest -k "property" --hypothesis-seed=42 passes on 1000 random utf-8 inputs.

Pitfalls¶

Hash-iteration order changing your tie-break. Python 3.7+ has insertion-ordered dicts, but max(dict, key=...) iterates in insertion order if multiple keys tie. Make sure the tie-break is on the value (count, pair) tuple, not relying on iteration order.
The merge loop is too slow. Naive recount-each-iteration is fine for our 300 KiB corpus but agonizing for testing if you bump vocab_size to 16k. Optimize: incremental pair-count updates (subtract counts of the merged pair's neighbors, add counts of the new pair's neighbors). Optional; OK to leave for the BLUEPRINT's "open question".
Round-trip fails on \xff byte. Likely an encoding issue at save/load — make sure you serialize bytes as hex, not as a Python repr.
hypothesis finds a counterexample. The most common: a Unicode surrogate pair (a U+D800–U+DFFF code point in isolation). Valid UTF-16 has these in pairs; valid UTF-8 doesn't allow them. Decision: filter them from the property-test strategy, or use hypothesis.strategies.text(alphabet=...) to restrict to valid UTF-8.
merges.txt line format wrong. Common: spaces in the byte hex break the parser. Use a fixed separator like \t.

Hint of last resort¶

If 4 hours in and the round-trip property test fails: print the first 5 failures hypothesis finds. Read the bytes. The bug is almost always in encoding (the merge application step), not decoding.

When to consult `solutions/`¶

After all four files committed and tests green. Solution: solutions/01-implement-bpe-ref.md (phase open) contains a reference implementation with line-by-line annotations.

Next lab: lab/02-bpe-on-verb-corpus.md.