English · Español

Lab 00 — BPE by hand on a toy English-verb corpus¶

Goal: trace the BPE algorithm on paper before writing code. Five merges on a 5-sentence corpus drawn from the §A13 verb-grammar topic, with the pair-count table at each step.

Estimated time: 30–45 minutes.

Prereq: theory/02-bpe-algorithm.md read.

What you produce¶

A single committed file at experiments/11-bpe-by-hand/trace.md containing:

The starting corpus and pre-token-with-counts.
The pair-count table at each iteration.
The chosen merge winner at each iteration (with tie-breaking shown explicitly).
The vocabulary after each merge.
A final reflection (1 short paragraph): what the trace taught you that the algorithm description in theory 02 didn't.

This is paper-and-pencil work. No code. No Python. Markdown tables are fine.

The corpus¶

"I work"   × 2
"I worked" × 1
"he works" × 1
"he worked" × 1

(Same as theory 02's worked example. Use this so you can compare your trace against the reference.)

Initial pre-tokens (treating each character as a byte for the toy; the literal space is shown as _):

("I","_","w","o","r","k")             : 2
("I","_","w","o","r","k","e","d")     : 1
("h","e","_","w","o","r","k","s")     : 1
("h","e","_","w","o","r","k","e","d") : 1

The interesting morphology to watch for: -s (3^rd-person singular works), -ed (regular past worked), and the shared stem work.

TODOs¶

Block A — count initial pairs¶

List every adjacent pair across the pre-tokens.
Sum counts (multiply by the pre-token's count).
Make a table. Sort by count descending; within ties, lexicographic ascending on the pair tuple.

Block B — pick the winner¶

Use the rule: highest count; lexicographic ascending tie-break.
Justify the choice in one sentence (e.g., "tied at 5 with ('o','r'), ('r','k'), ('w','o'); ('o','r') < ('r','k') < ('w','o') lex, so ('o','r') wins").

Block C — apply the merge¶

Write out the pre-tokens after the merge.
Add the new symbol to the vocab.
Update the merges list.

Block D — repeat 4 more times¶

Re-count pairs (with the new symbol).
Pick winner.
Apply.
Repeat until you have 5 merges total.

Block E — predict what comes next¶

Before reflecting, predict the next 3 merges (merges 6, 7, 8) you would do if you kept going. Write your guesses in trace.md. Then verify by running 3 more iterations on paper. This is a meta-exercise: BPE's order of merges is mechanically predictable from the pair counts; if you can predict it, you understand the algorithm.

Block F — reflect¶

One paragraph. Suggested prompts:
Did the morphology surface? (i.e., did work end up as a single symbol? Did -ed show up as a pair?)
Was tie-breaking ever ambiguous? How did the lexicographic rule resolve it?
What's the smallest change to the corpus that would produce a different sequence of merges?
If you re-ran the same trace with the corpus in a different order, would the answer be the same? Why?

Constraints¶

No code. Markdown tables and prose only.
Show all your work. Skipping the pair-count table for a step is forbidden — that's the step where bugs live.
Use one vocab table per merge step. A merge is meaningful only relative to a specific intermediate vocab; show the vocab evolving.

Stop conditions¶

You're done when:

trace.md exists with all 5 merge iterations fully written out.
Each iteration has a pair-count table.
The tie-breaking rule is invoked at least once and shown explicitly.
The Block E prediction is documented (and verified or revised).
The final reflection paragraph addresses at least one of the prompts.

Pitfalls¶

Forgetting that merges combine into the new symbol for subsequent pair counts. After the first merge ("o","r") → "or", the next iteration's pair counting should see ("w","or") as a pair, not ("w","o","r") as three separate elements.
Counting only unique pre-tokens. Multiply by the count. "I work" × 2 contributes ("I","_") with count 2.
Tie-breaking direction wrong. Lexicographic ascending — smaller pair wins. The opposite convention exists in some tutorials.
Not actually doing 5 merges. Three or four leaves the vocab too small to illustrate. Push to 5.
Treating _ (space) as nothing. It's a byte. Count it like any other.

When to consult `solutions/`¶

After trace.md is committed. Solution: solutions/00-toy-bpe-by-hand-ref.md (phase open) contains the canonical 5-merge trace for this corpus and a side-by-side comparison of what should emerge as merge candidates if you extended the trace to 20 iterations.

Next lab: lab/01-implement-bpe.md.