Skip to content

English · Español

03 — Byte-level BPE and Unicode pitfalls

Try it — text as raw bytes

🇪🇸 Operar sobre bytes (no caracteres Unicode) hace al tokenizer inmune a normalización, BOM, mojibake y entradas adversariales malformadas. El alfabeto base de 256 bytes nunca produce "out of vocab"; los emojis se tokenizan como sus bytes UTF-8 (cuatro bytes por carácter en el peor caso) y pueden o no fusionarse según el corpus. Para nuestro corpus bilingüe (inglés + español, §A13), los caracteres acentuados como á o ñ son 2 bytes cada uno antes de cualquier fusión.


Byte vs character: the surprisingly deep distinction

A "character" is a Unicode code point — an abstraction. A "byte" is a number 0–255 — concrete. The translation between them is an encoding (UTF-8, UTF-16, Latin-1, …).

UTF-8 maps Unicode code points to byte sequences of length 1 to 4:

  • ASCII (az, AZ, 09, common punctuation): 1 byte each. Same as the ASCII byte value.
  • Latin-1 supplement (á, ñ, é, ó, ú, ¿, ¡, …): 2 bytes each.
  • Latin Extended (ě, ą, …): 2 bytes each.
  • CJK (Chinese, Japanese, Korean): 3 bytes each.
  • Emoji, supplementary planes: 4 bytes each.

So "mañana" is 6 characters but 7 bytes: b'm', b'a', b'\xc3', b'\xb1', b'a', b'n', b'a' (the ñ is 0xC3 0xB1). And "🦊" is 1 character but 4 bytes: b'\xf0\x9f\xa6\x8a'.

Byte-level BPE: operate on the encoded bytes

Instead of starting with a "vocab of Unicode characters" (which is theoretically infinite — Unicode has ~150,000 assigned code points and growing), byte-level BPE starts with a fixed vocab of 256 bytes. Every input is first UTF-8 encoded to bytes, then BPE runs on the byte sequence.

Consequences:

  1. Never out of vocab. Every byte 0–255 is in the base vocab. Any input is some byte sequence; ergo, any input is tokenizable.
  2. Unicode-normalization-invariant. Two strings that render identically but have different code-point sequences (NFC vs NFD) tokenize as different byte sequences. We don't try to "fix" this; we tokenize the actual bytes the user gave us.
  3. Robust to malformed input. Garbage bytes (e.g., 0xFF, which can't appear in valid UTF-8) tokenize just fine — they're byte values like any other.

The cost: Spanish accents are 2 base tokens before any merges; emoji are 4. With BPE training on our verb corpus, the frequent Spanish multi-byte sequences (the ñ of mañana, the á of está) get merged into single tokens. The first iteration that picks up 0xC3 + 0xB1 is the one that births the ñ token. Observable in lab 02.

The Unicode normalization trap

Same-looking strings can be different byte sequences:

"mañana"   # NFC: 7 bytes (precomposed ñ): 6d 61 c3 b1 61 6e 61
"mañana"   # NFD: 8 bytes (n + combining tilde): 6d 61 6e cc 83 61 6e 61

(The visible glyphs are identical; the byte sequences differ.) A character-level tokenizer would have to decide which one is "canonical". A byte-level tokenizer just tokenizes whatever bytes it gets.

Why this matters:

  • Reproducibility: two corpora that are "the same text" can have different byte sequences if one was saved with macOS's HFS+ NFD normalization and the other with Linux's typical NFC. Byte-level BPE will train different merges on them.
  • Input pipeline consistency: when the Phase 32 grammar-tutor agent receives a learner's sentence pasted from an unknown source, NFC/NFD ambiguity could cause mañana to tokenize one way during training and another at inference.

For our corpus, the snippet generator (Phase 12 scripts/gen_corpus.py) emits all text using Python's default str.encode('utf-8'). Python strings are stored as Unicode and by default do not normalize. The corpus generator must explicitly call unicodedata.normalize('NFC', s) on every emitted line, and we encode that decision in the corpus BLUEPRINT. Inference inputs (Phase 32) go through the same NFC normalizer before tokenization.

Document this; consistency matters more than the specific choice.

Why GPT-2 went byte-level (the historical detail)

GPT-1 used a vocabulary of Unicode characters. To handle every Unicode character, you'd need a vocab in the millions. GPT-1 sidestepped this by training only on filtered English; everything else was <unk>.

GPT-2 wanted to handle multilingual text robustly. The team noticed: if BPE runs on bytes, not characters, the base vocab is always exactly 256 — regardless of how many languages or how exotic the text. The merges can then absorb common multi-byte patterns. Byte-level BPE was the trick that made GPT-2 multilingual.

We inherit this design. Our corpus is bilingual (English + Spanish per §A13), so byte-level isn't future-proofing — it is the right choice today. The Spanish accents (á, é, í, ó, ú, ñ, ¿, ¡) appear often enough across the conjugation matrix that BPE will merge them into single tokens, then merge those into common stem fragments (trab, habl, ).

Display: a token isn't always its bytes

A token's display form (what we write in plots and debug output) is not always the same as its byte representation. Examples:

  • The byte 0x20 (space) is displayed as ' '.
  • The bytes b'\xc3\xb1' (the ñ of mañana) are displayed as the character 'ñ' if the terminal can render UTF-8.
  • The byte b'\xff' (which never appears in valid UTF-8) is displayed as '\\xff' (escaped).

GPT-2's tokenizer has a notorious quirk: it remaps each byte to a printable Unicode character before display, using a fixed table. So byte 0x20 displays as a special 'Ġ' character (visible-leading-space marker). This makes top-merges plots more readable. We don't bother with this remapping; our corpus is small and we can show the literal bytes (with \\xNN escapes for non-ASCII) plus a side-by-side rendered glyph column in plots.

The Vocab dataclass (per src/minitoken/BLUEPRINT.md) has separate fields for id_to_bytes (canonical) and id_to_display (human-readable). Plots use id_to_display; encoding/decoding uses id_to_bytes.

What byte-level BPE handles and what it doesn't

What byte-level BPE handles:

  • Garbage bytes. b'\xff\xfe\xfd' tokenizes as three byte-tokens.
  • Mixed encodings. If a user pastes Latin-1 bytes that aren't valid UTF-8, byte-level treats them as bytes — no crash.
  • Adversarial whitespace. Tabs, non-breaking spaces, zero-width joiners all become their bytes.
  • BOM markers (\xef\xbb\xbf at the start of a file). Just bytes.
  • Spanish accents and emoji. Tokenized as their UTF-8 byte sequences, possibly merged after training.

What byte-level BPE does NOT handle (these are model-level concerns, not tokenizer-level):

  • Visual confusables. O (Latin O) vs Ο (Greek Omicron) vs 0 (digit zero). Different bytes, different tokens, identical or near-identical rendering. The model can't tell from the token alone. Mitigation requires character-class normalization on the input pipeline, not the tokenizer.
  • Steganographic encoding. Hidden data in whitespace lengths or in invisible characters (zero-width joiners). All tokenize; the model sees them as tokens. Mitigation is a model-input sanitizer.
  • Tokenization aliasing. Crafting input such that two visually-different sentences tokenize identically. Unlikely in practice for greedy BPE on our small corpus, but the lab's property tests should sanity-check.

We surface these in Phase 32, not Phase 11. Here, the takeaway is: byte-level BPE makes the tokenizer a non-issue.

NFC/NFD normalization decision for our project

Decision (recorded in BLUEPRINT):

  • Training corpus: scripts/gen_corpus.py (Phase 12) calls unicodedata.normalize('NFC', line) on every emitted line.
  • Inference inputs: the Phase 32 agent wrapper applies the same NFC normalization before tokenization.

Document this; consistency matters more than the specific choice.

Implementation note: file I/O modes

  • Always read training files in binary mode (open(path, 'rb')). Text mode (open(path, 'r')) does encoding conversion on Windows and may also normalize line endings. Binary preserves bytes exactly.
  • Always write tokenizer save files in binary or explicit UTF-8. A vocab.json saved on a Windows machine and loaded on Linux must round-trip byte-identically.

Lab 01's BPE implementation must do binary I/O. Common bug source.

Drill problems

Solutions in solutions/03-byte-level-and-unicode-ref.md (phase open).

  1. How many bytes does "¿cómo estás?" (a typical Spanish question) take in UTF-8? Hand-count the multi-byte characters; verify with len("¿cómo estás?".encode('utf-8')).
  2. Argue why byte-level BPE makes a tokenizer's behavior independent of the source file's encoding (assuming you read in binary). Why does this matter for reproducibility across operating systems?
  3. A malicious input is a 100-byte sequence containing only b'\xff'. The tokenizer trained on the verb corpus has never seen this byte. What does encode produce? (Hint: think about the base vocab and merges.)
  4. A learner pastes cafe (no accent) and café (with accent). Both are valid English/Spanish-borrowed words. Show how they tokenize differently and why that difference is information the model can use, not a bug.

What this theory page does NOT cover

  • Locale-aware case folding. Out of scope; we tokenize bytes as-is.
  • Compression-ratio analysis of byte-level vs char-level on our corpus. A lab 02 exercise, not a theory page.
  • Multi-script corpora beyond English+Spanish. Survey-mention only.
  • Pre-tokenization regex implementations (the GPT-2 pat). Theory 01 mentions; not implemented here.

One-paragraph recap

Byte-level BPE operates on UTF-8 bytes, with a fixed 256-byte base vocab. This eliminates out-of-vocab (every byte is in the vocab), Unicode-normalization headaches (we tokenize the bytes the user gave us), and adversarial-input crashes. The cost is that Spanish accents take 2 base tokens before merges, and emoji take 4 — but the merges absorb those into single tokens whenever they're frequent. For our bilingual English-Spanish verb corpus, byte-level is the correct (not just future-proof) choice. Tokenizer-level robustness is solid; visual confusables and aliasing remain model-level concerns for Phase 32.


Next: lab/00-toy-bpe-by-hand.md.