Skip to content

English · Español

Lab 01 — Implement BPE in pure Python

Goal: write src/minitoken/bpe.py from scratch. Train + encode + decode + save/load + round-trip property tests.

Estimated time: 4–6 hours, possibly across multiple sessions.

Prereq: lab 00 committed; src/minitoken/BLUEPRINT.md reviewed and refined per A12.


What you produce

  • src/minitoken/__init__.py — re-exports BPETokenizer, Vocab.
  • src/minitoken/bpe.py — the trainer + encoder + decoder.
  • src/minitoken/vocab.pyVocab dataclass + serialization helpers.
  • tests/test_bpe.py — unit + property tests.
  • tests/test_bpe.md (or a top-of-file docstring) — the test list written before the tests, per CLAUDE.md §1.

All four files must pass mypy --strict and ruff linting.

Test list (write this FIRST, per CLAUDE.md §1)

Before any implementation, write the list of tests you intend to write — as comments or in a markdown file. The list should include at minimum:

  • test_train_minimal_corpus — train on the lab-00 toy corpus, verify the first 5 merges match the hand trace.
  • test_encode_decode_roundtrip_ascii — for 20 hand-picked ASCII strings, decode(encode(s)) == s.
  • test_encode_decode_roundtrip_unicode — multibyte UTF-8 covering Spanish accents ("mañana", "¿cómo estás?") and emoji ("🦊").
  • test_encode_decode_property (hypothesis) — 1000 random utf-8 strings (ASCII + Spanish range + emoji), round-trip.
  • test_determinism — train twice with same seed → identical vocab + merges (byte-equal).
  • test_tiebreak — fixture corpus with a known pair-count tie; verify lex-ascending tie-break is applied.
  • test_save_load_roundtripBPETokenizer.load(t.save(path)).vocab == t.vocab byte-identical.
  • test_empty_stringencode("") == [], decode([]) == "".
  • test_single_byteencode("a") returns a length-1 list.
  • test_special_tokens_atomic — special tokens are never merged across.
  • test_vocab_size_target_reached — train with vocab_size=300 produces exactly 300 entries (256 bytes + specials + merges).
  • test_out_of_bounds_iddecode([99999]) raises a clear error.
  • test_mypy_strict — implicit (the CI step catches this).

TODOs

Block A — src/minitoken/vocab.py

  • Define Vocab as a frozen dataclass with id_to_bytes: tuple[bytes, ...], bytes_to_id: dict[bytes, int], id_to_display: tuple[str, ...].
  • Add Vocab.save(path) → writes vocab.json (id ↔ display ↔ hex(bytes) for human inspection).
  • Add Vocab.load(path) → reverse.
  • Unit tests: save → load → equality.

Block B — src/minitoken/bpe.py: training

  • class BPETokenizer with __init__(self).
  • _reserve_specials_and_bytes(special_tokens) — populate IDs 0..|S|-1 + 0..255 of base bytes.
  • _corpus_to_pretokens(corpus) — convert each string to a tuple of bytes-singletons; deduplicate to a Counter.
  • _count_pairs(pretokens) — return dict[(bytes, bytes), int].
  • _pick_winner(pair_counts) — implement min(..., key=lambda p: (-count, p)) (max count, then lexicographically smaller pair). Document the tie-break.
  • _apply_merge(pretokens, winner) — return new pretokens dict with the pair replaced.
  • train(corpus, vocab_size, special_tokens, verbose) — the main loop.
  • Log a progress bar if verbose=True.

Block C — src/minitoken/bpe.py: encode/decode

  • encode(text: str) -> list[int]:
  • UTF-8 encode the text.
  • Split into byte-singletons.
  • For each merge in order, replace (a, b) with ab.
  • Map symbols to IDs.
  • decode(ids: list[int]) -> str:
  • Map IDs to bytes; concatenate.
  • UTF-8 decode (errors='replace').

Block D — save/load

  • save(path: Path):
  • vocab.json (via Vocab.save).
  • merges.txt (one merge per line, bytes(a).hex() bytes(b).hex()).
  • config.json ({vocab_size, special_tokens, version}).
  • load(path: Path):
  • Read all three files.
  • Reconstruct BPETokenizer.

Block E — tests (tests/test_bpe.py)

  • Implement every test from the list above.
  • Use hypothesis for the property test.
  • All tests pass.
  • pytest --cov src/minitoken ≥ 90%.

Block F — sanity check on toy corpus

  • At the bottom of bpe.py or in experiments/11-toy-train/, run BPETokenizer().train(["I work"]*2 + ["I worked", "he works", "he worked"], vocab_size=261) (256 bytes + 5 merges).
  • Print the merges; verify they match your lab 00 hand-trace exactly.

Constraints

  • Pure Python + NumPy only. No tiktoken, no transformers, no sentencepiece — per CLAUDE.md §0 hard rule 4.
  • mypy --strict clean.
  • ruff clean.
  • bandit clean — no eval, no subprocess shells without shell=False, etc.
  • Determinism enforced via the tests/conftest.py seed fixture.
  • No regex pre-tokenization in v1. Raw bytes only.
  • No multibyte symbols in merges.txt lines. Serialize as hex.

Stop conditions

Done when:

  1. All four files committed.
  2. All tests pass; pytest -q is green.
  3. mypy --strict src/minitoken/ clean.
  4. The toy-corpus trace from lab 00 reproduces exactly via your trainer.
  5. pytest -k "property" --hypothesis-seed=42 passes on 1000 random utf-8 inputs.

Pitfalls

  • Hash-iteration order changing your tie-break. Python 3.7+ has insertion-ordered dicts, but max(dict, key=...) iterates in insertion order if multiple keys tie. Make sure the tie-break is on the value (count, pair) tuple, not relying on iteration order.
  • The merge loop is too slow. Naive recount-each-iteration is fine for our 300 KiB corpus but agonizing for testing if you bump vocab_size to 16k. Optimize: incremental pair-count updates (subtract counts of the merged pair's neighbors, add counts of the new pair's neighbors). Optional; OK to leave for the BLUEPRINT's "open question".
  • Round-trip fails on \xff byte. Likely an encoding issue at save/load — make sure you serialize bytes as hex, not as a Python repr.
  • hypothesis finds a counterexample. The most common: a Unicode surrogate pair (a U+D800–U+DFFF code point in isolation). Valid UTF-16 has these in pairs; valid UTF-8 doesn't allow them. Decision: filter them from the property-test strategy, or use hypothesis.strategies.text(alphabet=...) to restrict to valid UTF-8.
  • merges.txt line format wrong. Common: spaces in the byte hex break the parser. Use a fixed separator like \t.

Hint of last resort

If 4 hours in and the round-trip property test fails: print the first 5 failures hypothesis finds. Read the bytes. The bug is almost always in encoding (the merge application step), not decoding.

When to consult solutions/

After all four files committed and tests green. Solution: solutions/01-implement-bpe-ref.md (phase open) contains a reference implementation with line-by-line annotations.


Next lab: lab/02-bpe-on-verb-corpus.md.