English · Español

02 — The BPE Algorithm: training and encoding, fully worked¶

🇪🇸 BPE entrena así: cuenta los pares adyacentes más frecuentes, fusiona el ganador en un símbolo nuevo, repite hasta llegar al tamaño del vocabulario. La codificación re-aplica esas fusiones, en orden, sobre nuevos textos. Sin trucos. La parte difícil es la contabilidad y el desempate.

Training: the algorithm¶

Input: a corpus C (list of strings); a target vocab size V; optional reserved special tokens S.

Output: a vocab vocab (list of token-id → bytes mapping); an ordered list of merges merges.

Step 0 — initialize.

Reserve IDs 0..|S|-1 for special tokens.
Add the 256 single bytes (IDs |S|..|S|+255). The base alphabet of byte-level BPE is the byte values, every one of them, always.
Split each string in \(C\) into a list of single bytes. So "hi" becomes [b'h', b'i'] becomes [(b'h',), (b'i',)] — a tuple per pre-token. (We use tuples of bytes so they're hashable for the pair-count dict.)

Step 1 — count pairs.

For each "pre-token" (a tuple of byte-symbols), look at all adjacent pairs. Count them across the corpus.

for pretoken_tuple, count in corpus_with_counts:
    for i in range(len(pretoken_tuple) - 1):
        pair = (pretoken_tuple[i], pretoken_tuple[i+1])
        pair_counts[pair] += count

(corpus_with_counts is the corpus with duplicate-pretoken consolidation — the same pre-token appearing \(k\) times gets count \(k\), not enumerated \(k\) times.)

Step 2 — pick the winner.

winner = min(pair_counts, key=lambda p: (-pair_counts[p], p))

The (-count, pair) tuple-key is the tie-breaking rule: among pairs with the maximum count, ties go to the lexicographically smaller pair (for determinism). The negation flips min into "max by count, min by pair". Document this.

Step 3 — merge the winner.

For each pre-token tuple containing the winning pair as adjacent elements, replace those two elements with one new combined symbol.

new_symbol = winner[0] + winner[1]   # bytes concatenation
for pretoken in corpus:
    pretoken = replace_pair(pretoken, winner, new_symbol)

Add new_symbol to the vocab (with the next available ID). Record the merge: merges.append(winner).

Step 4 — repeat steps 1–3.

Until len(vocab) == V. That's it.

The toy example, in full — five English verb sentences¶

Corpus (5 strings drawn from the §A13 topic):

"I work"
"I work"
"I worked"
"he works"
"he worked"

For brevity, we collapse leading-space normalization and ignore the space before the line break. We treat the pre-tokens as the byte sequences below (_ denotes a literal space byte 0x20):

Pre-token-with-counts:

("I","_","w","o","r","k")          : 2
("I","_","w","o","r","k","e","d")  : 1
("h","e","_","w","o","r","k","s")  : 1
("h","e","_","w","o","r","k","e","d"): 1

Iteration 1 — count pairs:

("w","o"): 5    ("o","r"): 5    ("r","k"): 5
("I","_"): 3    ("_","w"): 3
("h","e"): 2    ("e","_"): 2
("k","e"): 2    ("e","d"): 2
("k","s"): 1

Winner: ties at count 5 between ("w","o"), ("o","r"), ("r","k"). Lexicographic order on the pair tuple picks ("o","r") (o > w but the first element wins; actually o < w lexicographically, so ("o","r") < ("r","k") < ("w","o") — winner is ("o","r")).

Merge to "or".

Pre-tokens after merge:

("I","_","w","or","k")            : 2
("I","_","w","or","k","e","d")    : 1
("h","e","_","w","or","k","s")    : 1
("h","e","_","w","or","k","e","d"): 1

merges = [("o","r")]. Vocab now: specials + 256 bytes + "or".

Iteration 2 — count pairs:

("w","or"): 5   ("or","k"): 5
("I","_"): 3    ("_","w"): 3
("h","e"): 2    ("e","_"): 2
("k","e"): 2    ("e","d"): 2
("k","s"): 1

Winner: tie at 5; lex picks ("or","k") (since or > _ > I but or < w; among the 5-count pairs ("or","k") < ("w","or")). Merge to "ork".

Pre-tokens:

("I","_","w","ork")            : 2
("I","_","w","ork","e","d")    : 1
("h","e","_","w","ork","s")    : 1
("h","e","_","w","ork","e","d"): 1

merges = [("o","r"), ("or","k")].

Iteration 3 —

("w","ork"): 5
("I","_"): 3    ("_","w"): 3
("h","e"): 2    ("e","_"): 2
("ork","e"): 2  ("e","d"): 2
("ork","s"): 1

Winner: ("w","ork") at 5. Merge to "work".

Pre-tokens:

("I","_","work")            : 2
("I","_","work","e","d")    : 1
("h","e","_","work","s")    : 1
("h","e","_","work","e","d"): 1

merges = [("o","r"), ("or","k"), ("w","ork")].

…and so on. After a few more iterations you'd see work (with leading space) emerge as a single token, then worked (the past form merge of work + ed), then works (the 3^rd-person merge). These are the morphological wins — observable from the merge log.

This is the entire algorithm. Lab 00 has Borja do exactly this on a fresh toy corpus.

Encoding: applying the merges¶

Once trained, BPE encodes new strings by applying the merges in training order, greedily.

Algorithm:

Start with the byte-tuple of the input string.
For each merge rule (a, b) → ab in merges (in training order):
Scan the byte-tuple; replace every occurrence of a immediately followed by b with ab.
Output the IDs of the resulting tuple.

Crucial: the order matters. Applying ("w","ork") before ("o","r") would never trigger — there's no "ork" symbol in the byte stream of a new input until the ("o","r") merge runs first.

Faster encoding (production version): instead of scanning left-to-right for each merge, build a priority queue over candidate merges in the current sequence. At each step, apply the merge with the lowest training-rank. This avoids the linear scan and is \(O(L \log L)\) per input rather than \(O(L \times |\text{merges}|)\). We implement the simple version.

Decoding: trivially¶

decode(ids) = b''.join(id_to_bytes[id] for id in ids).decode('utf-8')

Map each ID back to its byte sequence, concatenate, decode UTF-8.

The round-trip decode(encode(s)) == s holds for any valid UTF-8 input if implemented correctly. Property-test this.

Edge cases that the algorithm above handles¶

Empty string: encode("") == [], decode([]) == "". Trivial.
Single character: encode("a") produces one ID (the byte 97 → token ID). No merges apply (no adjacent pairs in a length-1 tuple).
Multibyte UTF-8: encode("mañana") starts as [m, a, 0xC3, 0xB1, a, n, a] (7 bytes; ñ is 2 bytes 0xC3 0xB1). If mañana (or any of its subsequences) appears often in training, BPE may merge 0xC3 + 0xB1 into a ñ symbol, then a + ñ → añ, etc.
Emoji: encode("🦊") starts as 4 byte-singletons (UTF-8 of 🦊 is 0xF0 0x9F 0xA6 0x8A). Unless emojis are frequent in training (they are not in our verb corpus), no merges apply — the emoji decodes back exactly as 4 raw bytes.
Control characters, BOM: same treatment — they're just bytes.
Strings with multiple words: without pre-tokenization (our v1 choice), spaces are just byte 0x20 in the sequence. BPE will or won't merge them with neighbors based on frequency.

Edge cases that the algorithm above doesn't handle (and how to handle them)¶

Special tokens (<|endoftext|>, <|pad|>, <|sep|>): the trainer must not merge them or merge across them. Implementation: reserve their IDs at step 0, mark them as "atomic" — never participate in pair counting. In the encoder, pre-split the input on special-token boundaries before BPE, then re-stitch.
Very large corpora: the naive trainer is \(O(V \times N)\). For our small bootstrap corpus (~30 sentences ≈ 1 KiB) and \(V = 512\), that's V × N ≈ 5 × 10⁵ pair-count operations. On Borja's i5-8250U this is sub-second in Python. The full Phase 12 corpus (~600 forms ≈ 30 KiB) is V × N ≈ 1.5 × 10⁷ ops — still sub-second.

Complexity, more carefully¶

Op	Naive	Priority-queue
Training	\(O(V \times N)\) where \(N\) = corpus bytes	\(O(V \log N + N)\)
Encoding (per input of length \(L\))	\(O(L \times \\|\text{merges}\\|)\)	\(O(L \log L)\)
Decoding	\(O(L)\)	\(O(L)\)
Memory	\(O(V + N)\)	\(O(V + N)\)

We implement naive. Note priority-queue version exists.

Tie-breaking — the under-discussed detail¶

When two pairs tie on count, the trainer must pick one deterministically. Common conventions:

Lexicographic on the pair tuple (what we use). (b'a', b'b') < (b'a', b'c') — Python's tuple comparison handles this.
First-seen wins (order-dependent — bad for reproducibility unless you canonicalize the corpus traversal order).
Random tie-break with explicit seed (extra knob — we avoid it).

Without explicit tie-breaking, training is non-deterministic even with a fixed seed. The hash-iteration order of Python dicts has been stable since 3.7 but was a real bug in earlier versions. We use convention (1) and document it.

Implementation sketch (for lab 01)¶

class BPETokenizer:
    def __init__(self):
        self.vocab: list[bytes] = []
        self.bytes_to_id: dict[bytes, int] = {}
        self.merges: list[tuple[bytes, bytes]] = []

    def train(self, corpus: Iterable[str], vocab_size: int, ...):
        # Step 0
        self._reserve_specials_and_bytes(...)

        # Step 0.5: convert corpus to pretoken-tuples-with-counts
        pretokens: dict[tuple[bytes, ...], int] = collections.Counter(...)

        # Step 1-4: iterate
        while len(self.vocab) < vocab_size:
            pair_counts = self._count_pairs(pretokens)
            if not pair_counts:
                break
            winner = max(pair_counts, key=lambda p: (pair_counts[p], p))
            pretokens = self._apply_merge(pretokens, winner)
            self._record_merge(winner)

    def encode(self, text: str) -> list[int]:
        bs = text.encode('utf-8')
        symbols = [bytes([b]) for b in bs]
        for (a, b) in self.merges:
            symbols = self._merge_in_place(symbols, a, b)
        return [self.bytes_to_id[s] for s in symbols]

    def decode(self, ids: list[int]) -> str:
        return b''.join(self.vocab[i] for i in ids).decode('utf-8', errors='replace')

errors='replace' in decode is a safety net for invalid byte sequences in the output (which can happen if a learner constructs bad IDs by hand; encode-then-decode round-trip on valid UTF-8 input is exact).

Drill problems¶

Solutions in solutions/02-bpe-algorithm-ref.md (phase open).

For the corpus ["he eats", "he eats", "I eat", "you eat"] with target vocab = base bytes + 4 merges, do 4 BPE iterations by hand. Show the pair-count table at each step and the resulting vocab. Predict whether eat becomes a single token by iteration 4.
Argue why applying merges in training order during encoding (rather than alphabetical order, or some other order) is necessary for encode(s) ≡ retokenize(decode(encode(s))) to hold.
With a corpus of \(N\) bytes and target vocab \(V\), the naive trainer recomputes pair counts from scratch each iteration. Suggest the simplest "incremental update" that avoids the recount and analyze its complexity.
The toy example above has ("o","r") merging before ("w","or"). Suppose instead the corpus had every "or" precomposed as a single byte (hypothetical). What would the merge sequence be? (Hint: trace the new pair counts.)

What this theory page does NOT cover¶

Pre-tokenization regex (GPT-2 pat). Touched in theory 01; not implemented in our BPE.
Subword regularization (BPE-dropout). Phase 18+ if needed; not used here.
Vocabulary serialization format. Lab 01 specifies the on-disk format (merges.txt + vocab.json).
Multi-corpus / multi-domain training. Out of scope; one corpus, one BPE.

One-paragraph recap¶

BPE training: initialize with all 256 bytes; repeatedly count adjacent pairs, pick the most frequent (with deterministic tie-breaking), merge it into a new symbol, until vocab size is reached. Encoding: apply trained merges in order, greedily, on a byte-tuple of the input. Decoding: concatenate the bytes for each ID, UTF-8 decode. Naive complexity is \(O(V \times N)\) for training; \(O(L \times \|\text{merges}\|)\) for encoding; both are fine for our small bilingual verb corpus. The under-discussed implementation detail is deterministic tie-breaking — we use lexicographic on the pair tuple.

Next: theory/03-byte-level-and-unicode.md.