Skip to content

English · Español

Lab 02 — Train BPE on the English-verb corpus

Goal: use the BPE trainer (lab 01) on the Phase 12 corpus. Three vocab sizes. Zipf plot. Top-30 merges sanity check with morphology annotations.

Estimated time: 90–120 minutes.

Prereq: lab 01 committed; Phase 12 corpus generated (forward-ref — see Phase 11 plan §7.a for the bootstrap path).


What you produce

A directory experiments/11-bpe-on-verb-corpus/ containing:

  • train.py — orchestrates training at three vocab sizes.
  • vocabs/{256+128,512,1024}/ — three trained vocabularies (each with vocab.json, merges.txt, config.json).
  • zipf.png — log-log plot of token frequency rank vs frequency, for the 512 vocab.
  • top_merges.png — bar chart of the top 30 merges' counts (512 vocab).
  • top_merges.md — the top 30 merges as a readable list with morphology annotations (suffix / prefix / stem / function-word / punctuation / Spanish-specific).
  • manifest.json.
  • README.md.

The setup

Load the Phase 12 corpus (data/processed/train.jsonl — the text field of each row, which contains (English_sentence, Spanish_translation) pairs, one or both per row). Concatenate or feed line-by-line as input to BPETokenizer.train.

If Phase 12 isn't done yet, use the bootstrap corpus at data/raw/bootstrap-en.txt (30 hand-typed English-verb sentences with Spanish pairs) per Phase 11 plan §7.a. Re-run this lab at the close of Phase 12 with the full corpus.

Train three vocabs:

  • vocab_size = 384 (256 bytes + 4 specials + 124 merges) — small; expect base bytes + ASCII printable + a few morphology merges (-s, -ed).
  • vocab_size = 512 — canonical. The one we use in Phase 13+.
  • vocab_size = 1024 — large; expect long tail of stem-level merges and Spanish-specific merges. Sanity check the trainer scales.

For each, encode the full training corpus and count token frequencies. Save the counts.

TODOs

Block A — corpus loading

  • Load data/processed/train.jsonl (or data/raw/bootstrap-en.txt for the bootstrap path).
  • Extract the relevant text fields (English + Spanish, paired).
  • Print total bytes; total sentences; mean/median sentence length in bytes.
  • Print the proportion of bytes that are Spanish (i.e. non-ASCII multi-byte sequences). For our corpus, expect ~3–6%.

Block B — three trainings

  • For each vocab_size in [384, 512, 1024]:
  • Construct a fresh BPETokenizer.
  • Call .train(corpus, vocab_size, special_tokens=["<|pad|>", "<|endoftext|>", "<|unk|>", "<|sep|>"]).
  • Save to vocabs/<size>/.
  • Time the training.
  • Print: training time per vocab size; final vocab size matches target.

Block C — Zipf plot

  • Pick the 512 vocab.
  • Encode the full training corpus with it.
  • Count occurrences of each token ID.
  • Sort frequencies descending.
  • Plot rank (x, log) vs frequency (y, log). A Zipf-like corpus shows a near-linear curve in log-log.
  • Save zipf.png.

Block D — top-30 merges with morphology annotations

  • List the top 30 merges by training count (or by encoded-corpus token-frequency rank — pick one and document).
  • For each, annotate the morphological role: suffix (e.g., -s, -ed, -ing), prefix (rare in English; un- if it appears), stem (e.g., work, play), function word (will, to, he, I, you), punctuation/whitespace (., ,, \n), or Spanish-specific (trab, , ñ).
  • Visual sanity check (DoD bar): does the top-30 contain at least these morphological wins?
  • -s (3rd-person singular present)
  • -ed (regular past)
  • will (simple future)
  • to (going to future + infinitive marker)
  • ing (present participle / going to)
  • At least 3 whole English verb stems as single tokens (work, play, walk, watch, etc.).
  • At least 2 Spanish-specific tokens (a Spanish stem fragment or a multi-byte accented character).
  • If any expected morpheme is missing, flag in README — possibly need a larger bootstrap corpus or wait for Phase 12's full output.
  • Save top_merges.md (markdown table with morphology column) and top_merges.png (bar chart).

Block E — interpret

In README.md:

  1. Did the morphological wins land? List each expected morpheme and the merge rank at which it appeared. The headline post-A13 sanity check.
  2. English vs Spanish merge balance. How many of the top 30 merges are English-specific, Spanish-specific, or shared (whitespace/punctuation)? Predict before counting; verify after.
  3. Does the Zipf plot look log-linear? Is the slope close to −1 (classic Zipf) or steeper / shallower? Short corpora often deviate from pure Zipf.
  4. Training time at 1024 vocab vs 384 vocab. Does it scale roughly linearly in vocab_size as the naive complexity predicts?
  5. What changes between vocab 384 and 1024? Inspect the diff in top tokens. Do whole-word verbs (worked, works) emerge at 1024 that weren't single tokens at 384?

Block F — manifest

Standard. Include all three vocab paths and their SHA256s.

Constraints

  • Per CLAUDE.md §0 hard rule 5, every script calls seed_everything(seed) and writes the manifest with versions + seed + config.
  • The corpus is read from Phase 12 output (or the bootstrap path during pre-A12-rerun).
  • NFC normalize on input. Per theory 03, call unicodedata.normalize('NFC', s) before encoding bytes.
  • No tiktoken for comparison in v1. That's a Phase 24 exercise.

Stop conditions

Done when:

  1. All seven files committed.
  2. Three vocabs trained; sizes verified.
  3. Zipf plot looks roughly log-linear (allowing for short-corpus deviation).
  4. At least 5 of the 6 morphological wins listed in Block D appear in vocab=512. Flag any missing ones.
  5. README answers all five Block E questions.

Pitfalls

  • Bootstrap corpus too small. 30 sentences is small enough that some expected morphemes may not surface. If you see -s but not -ed, that's the corpus, not the trainer. Re-run after Phase 12 with the full 600-form corpus.
  • Training is annoyingly slow at 1024 vocab. ~30 seconds to a few minutes is fine; ~hours means you have a bug in the merge-application step (probably \(O(N^2)\) scan instead of \(O(N)\)).
  • Some verbs get split. worked may tokenize as work + ed or as a single token at vocab 1024. Acceptable. Note in README.
  • Top merges include weird whitespace. Things like (lone space) at rank 1 of the merges are normal — it's the most frequent byte. Document.
  • Spanish merges win counts but render as \\xC3\\xB1 in plots. Use the id_to_display field of Vocab (per theory 03) to render properly.
  • NFC mismatch. If the corpus was saved as NFD but the BPE-trained vocab is NFC, an á rendered NFC won't tokenize the same as the training NFD á. Both must use the same normalization. Document in config.json.

Hint of last resort

If 90 minutes in and the top-30 merges don't contain -s or -ed: your trainer is broken (likely the pair-count update step) or the bootstrap corpus is too small/skewed. Verify the lab-00 toy reproduces first; that's the canonical sanity check.

When to consult solutions/

After all seven files. Solution: solutions/02-bpe-on-verb-corpus-ref.md (phase open). The reference contains:

  • Expected training times per vocab size on Borja's i5-8250U.
  • Expected top-30 merges (the canonical list for the bootstrap and full corpora).
  • Discussion of which verbs remain split at each vocab size and why.

Phase 11 lab sequence complete. Next phase: docs/phase-12-corpus-design/.