English · Español
Lab 02 — Train BPE on the English-verb corpus¶
Goal: use the BPE trainer (lab 01) on the Phase 12 corpus. Three vocab sizes. Zipf plot. Top-30 merges sanity check with morphology annotations.
Estimated time: 90–120 minutes.
Prereq: lab 01 committed; Phase 12 corpus generated (forward-ref — see Phase 11 plan §7.a for the bootstrap path).
What you produce¶
A directory experiments/11-bpe-on-verb-corpus/ containing:
train.py— orchestrates training at three vocab sizes.vocabs/{256+128,512,1024}/— three trained vocabularies (each withvocab.json,merges.txt,config.json).zipf.png— log-log plot of token frequency rank vs frequency, for the 512 vocab.top_merges.png— bar chart of the top 30 merges' counts (512 vocab).top_merges.md— the top 30 merges as a readable list with morphology annotations (suffix / prefix / stem / function-word / punctuation / Spanish-specific).manifest.json.README.md.
The setup¶
Load the Phase 12 corpus (data/processed/train.jsonl — the text field of each row, which contains (English_sentence, Spanish_translation) pairs, one or both per row). Concatenate or feed line-by-line as input to BPETokenizer.train.
If Phase 12 isn't done yet, use the bootstrap corpus at data/raw/bootstrap-en.txt (30 hand-typed English-verb sentences with Spanish pairs) per Phase 11 plan §7.a. Re-run this lab at the close of Phase 12 with the full corpus.
Train three vocabs:
vocab_size = 384(256 bytes + 4 specials + 124 merges) — small; expect base bytes + ASCII printable + a few morphology merges (-s,-ed).vocab_size = 512— canonical. The one we use in Phase 13+.vocab_size = 1024— large; expect long tail of stem-level merges and Spanish-specific merges. Sanity check the trainer scales.
For each, encode the full training corpus and count token frequencies. Save the counts.
TODOs¶
Block A — corpus loading¶
- Load
data/processed/train.jsonl(ordata/raw/bootstrap-en.txtfor the bootstrap path). - Extract the relevant text fields (English + Spanish, paired).
- Print total bytes; total sentences; mean/median sentence length in bytes.
- Print the proportion of bytes that are Spanish (i.e. non-ASCII multi-byte sequences). For our corpus, expect ~3–6%.
Block B — three trainings¶
- For each
vocab_sizein[384, 512, 1024]: - Construct a fresh
BPETokenizer. - Call
.train(corpus, vocab_size, special_tokens=["<|pad|>", "<|endoftext|>", "<|unk|>", "<|sep|>"]). - Save to
vocabs/<size>/. - Time the training.
- Print: training time per vocab size; final vocab size matches target.
Block C — Zipf plot¶
- Pick the 512 vocab.
- Encode the full training corpus with it.
- Count occurrences of each token ID.
- Sort frequencies descending.
- Plot rank (x, log) vs frequency (y, log). A Zipf-like corpus shows a near-linear curve in log-log.
- Save
zipf.png.
Block D — top-30 merges with morphology annotations¶
- List the top 30 merges by training count (or by encoded-corpus token-frequency rank — pick one and document).
- For each, annotate the morphological role: suffix (e.g.,
-s,-ed,-ing), prefix (rare in English;un-if it appears), stem (e.g.,work,play), function word (will,to,he,I,you), punctuation/whitespace (.,,,\n), or Spanish-specific (trab,ió,ñ). - Visual sanity check (DoD bar): does the top-30 contain at least these morphological wins?
-s(3rd-person singular present)-ed(regular past)will(simple future)to(going tofuture + infinitive marker)ing(present participle /going to)- At least 3 whole English verb stems as single tokens (
work,play,walk,watch, etc.). - At least 2 Spanish-specific tokens (a Spanish stem fragment or a multi-byte accented character).
- If any expected morpheme is missing, flag in README — possibly need a larger bootstrap corpus or wait for Phase 12's full output.
- Save
top_merges.md(markdown table with morphology column) andtop_merges.png(bar chart).
Block E — interpret¶
In README.md:
- Did the morphological wins land? List each expected morpheme and the merge rank at which it appeared. The headline post-A13 sanity check.
- English vs Spanish merge balance. How many of the top 30 merges are English-specific, Spanish-specific, or shared (whitespace/punctuation)? Predict before counting; verify after.
- Does the Zipf plot look log-linear? Is the slope close to −1 (classic Zipf) or steeper / shallower? Short corpora often deviate from pure Zipf.
- Training time at 1024 vocab vs 384 vocab. Does it scale roughly linearly in
vocab_sizeas the naive complexity predicts? - What changes between vocab 384 and 1024? Inspect the diff in top tokens. Do whole-word verbs (
worked,works) emerge at 1024 that weren't single tokens at 384?
Block F — manifest¶
Standard. Include all three vocab paths and their SHA256s.
Constraints¶
- Per CLAUDE.md §0 hard rule 5, every script calls
seed_everything(seed)and writes the manifest withversions + seed + config. - The corpus is read from Phase 12 output (or the bootstrap path during pre-A12-rerun).
- NFC normalize on input. Per theory 03, call
unicodedata.normalize('NFC', s)before encoding bytes. - No
tiktokenfor comparison in v1. That's a Phase 24 exercise.
Stop conditions¶
Done when:
- All seven files committed.
- Three vocabs trained; sizes verified.
- Zipf plot looks roughly log-linear (allowing for short-corpus deviation).
- At least 5 of the 6 morphological wins listed in Block D appear in vocab=512. Flag any missing ones.
- README answers all five Block E questions.
Pitfalls¶
- Bootstrap corpus too small. 30 sentences is small enough that some expected morphemes may not surface. If you see
-sbut not-ed, that's the corpus, not the trainer. Re-run after Phase 12 with the full 600-form corpus. - Training is annoyingly slow at 1024 vocab. ~30 seconds to a few minutes is fine; ~hours means you have a bug in the merge-application step (probably \(O(N^2)\) scan instead of \(O(N)\)).
- Some verbs get split.
workedmay tokenize aswork+edor as a single token at vocab 1024. Acceptable. Note in README. - Top merges include weird whitespace. Things like
(lone space) at rank 1 of the merges are normal — it's the most frequent byte. Document. - Spanish merges win counts but render as
\\xC3\\xB1in plots. Use theid_to_displayfield ofVocab(per theory 03) to render properly. - NFC mismatch. If the corpus was saved as NFD but the BPE-trained vocab is NFC, an
árendered NFC won't tokenize the same as the training NFDá. Both must use the same normalization. Document inconfig.json.
Hint of last resort¶
If 90 minutes in and the top-30 merges don't contain -s or -ed: your trainer is broken (likely the pair-count update step) or the bootstrap corpus is too small/skewed. Verify the lab-00 toy reproduces first; that's the canonical sanity check.
When to consult solutions/¶
After all seven files. Solution: solutions/02-bpe-on-verb-corpus-ref.md (phase open). The reference contains:
- Expected training times per vocab size on Borja's i5-8250U.
- Expected top-30 merges (the canonical list for the bootstrap and full corpora).
- Discussion of which verbs remain split at each vocab size and why.
Phase 11 lab sequence complete. Next phase: docs/phase-12-corpus-design/.