Skip to content

English · Español

Lab 00 — BPE by hand on a toy English-verb corpus

Goal: trace the BPE algorithm on paper before writing code. Five merges on a 5-sentence corpus drawn from the §A13 verb-grammar topic, with the pair-count table at each step.

Estimated time: 30–45 minutes.

Prereq: theory/02-bpe-algorithm.md read.


What you produce

A single committed file at experiments/11-bpe-by-hand/trace.md containing:

  • The starting corpus and pre-token-with-counts.
  • The pair-count table at each iteration.
  • The chosen merge winner at each iteration (with tie-breaking shown explicitly).
  • The vocabulary after each merge.
  • A final reflection (1 short paragraph): what the trace taught you that the algorithm description in theory 02 didn't.

This is paper-and-pencil work. No code. No Python. Markdown tables are fine.

The corpus

"I work"   × 2
"I worked" × 1
"he works" × 1
"he worked" × 1

(Same as theory 02's worked example. Use this so you can compare your trace against the reference.)

Initial pre-tokens (treating each character as a byte for the toy; the literal space is shown as _):

("I","_","w","o","r","k")             : 2
("I","_","w","o","r","k","e","d")     : 1
("h","e","_","w","o","r","k","s")     : 1
("h","e","_","w","o","r","k","e","d") : 1

The interesting morphology to watch for: -s (3rd-person singular works), -ed (regular past worked), and the shared stem work.

TODOs

Block A — count initial pairs

  • List every adjacent pair across the pre-tokens.
  • Sum counts (multiply by the pre-token's count).
  • Make a table. Sort by count descending; within ties, lexicographic ascending on the pair tuple.

Block B — pick the winner

  • Use the rule: highest count; lexicographic ascending tie-break.
  • Justify the choice in one sentence (e.g., "tied at 5 with ('o','r'), ('r','k'), ('w','o'); ('o','r') < ('r','k') < ('w','o') lex, so ('o','r') wins").

Block C — apply the merge

  • Write out the pre-tokens after the merge.
  • Add the new symbol to the vocab.
  • Update the merges list.

Block D — repeat 4 more times

  • Re-count pairs (with the new symbol).
  • Pick winner.
  • Apply.
  • Repeat until you have 5 merges total.

Block E — predict what comes next

Before reflecting, predict the next 3 merges (merges 6, 7, 8) you would do if you kept going. Write your guesses in trace.md. Then verify by running 3 more iterations on paper. This is a meta-exercise: BPE's order of merges is mechanically predictable from the pair counts; if you can predict it, you understand the algorithm.

Block F — reflect

  • One paragraph. Suggested prompts:
  • Did the morphology surface? (i.e., did work end up as a single symbol? Did -ed show up as a pair?)
  • Was tie-breaking ever ambiguous? How did the lexicographic rule resolve it?
  • What's the smallest change to the corpus that would produce a different sequence of merges?
  • If you re-ran the same trace with the corpus in a different order, would the answer be the same? Why?

Constraints

  • No code. Markdown tables and prose only.
  • Show all your work. Skipping the pair-count table for a step is forbidden — that's the step where bugs live.
  • Use one vocab table per merge step. A merge is meaningful only relative to a specific intermediate vocab; show the vocab evolving.

Stop conditions

You're done when:

  1. trace.md exists with all 5 merge iterations fully written out.
  2. Each iteration has a pair-count table.
  3. The tie-breaking rule is invoked at least once and shown explicitly.
  4. The Block E prediction is documented (and verified or revised).
  5. The final reflection paragraph addresses at least one of the prompts.

Pitfalls

  • Forgetting that merges combine into the new symbol for subsequent pair counts. After the first merge ("o","r") → "or", the next iteration's pair counting should see ("w","or") as a pair, not ("w","o","r") as three separate elements.
  • Counting only unique pre-tokens. Multiply by the count. "I work" × 2 contributes ("I","_") with count 2.
  • Tie-breaking direction wrong. Lexicographic ascending — smaller pair wins. The opposite convention exists in some tutorials.
  • Not actually doing 5 merges. Three or four leaves the vocab too small to illustrate. Push to 5.
  • Treating _ (space) as nothing. It's a byte. Count it like any other.

When to consult solutions/

After trace.md is committed. Solution: solutions/00-toy-bpe-by-hand-ref.md (phase open) contains the canonical 5-merge trace for this corpus and a side-by-side comparison of what should emerge as merge candidates if you extended the trace to 20 iterations.


Next lab: lab/01-implement-bpe.md.