English · Español
Phase 11 — Quiz (human-readable mirror)¶
🇪🇸 Espejo legible del fichero canónico
data/quizzes/phase-11-tokenization.yaml. El portal (Phase 41) consume el YAML.
Source: data/quizzes/phase-11-tokenization.yaml.
q-11-01 — What is the core training objective of BPE? (single)¶
- A random pair of adjacent symbols
- The most-frequent pair of adjacent symbols in the corpus ✓
- The pair with the highest mutual information
- The pair that minimizes perplexity on a held-out set
Vanilla BPE (Sennrich et al. 2016) greedily merges the most frequent pair until the vocab cap is hit.
q-11-02 — Which properties are typical of byte-level BPE? (multi)¶
- It can encode any UTF-8 string without unknown tokens ✓
- It needs a separate
<UNK>token for unseen characters - Its base alphabet has 256 tokens ✓
- Whitespace must always be a single token
Byte-level BPE starts from the 256 raw bytes, so every UTF-8 string is representable; no
<UNK>required.
q-11-03 — Why pre-tokenize before BPE? (free)¶
Expected to contain: word.
Pre-tokenization keeps merges from crossing word/punctuation boundaries, preserving linguistic units and stabilizing vocab growth.
q-11-04 — Bytes vs codepoints on the §A13 bilingual corpus (single)¶
A learner trains byte-level BPE with vocab=256 on the bilingual §A13 corpus. Mean-tokens-per-English-word is 1.4; mean-tokens-per-Spanish-word is 2.1. What explains the gap?
- Spanish words are intrinsically longer than English words
- Accented characters cost 2 bytes each, consuming merges for parity ✓
- The corpus has more English than Spanish examples
- BPE is fundamentally biased toward Germanic languages
ñ á é í ó ú üare each 2 UTF-8 bytes. To become single tokens each needs one merge — 7 merges spent on alphabet recovery. Vocab=512 restores parity.
q-11-05 — Find the bug: tokenizer trained on wrong corpus (free)¶
Expected to contain: spanish.
Spanish text fragments into far more tokens than English (the merge schedule never learned Spanish bigrams), giving longer sequences and higher attention cost. Cross-link:
break/00-break-train-english-only-bpe.md.