Skip to content

English · Español

Phase 11 — Quiz (human-readable mirror)

🇪🇸 Espejo legible del fichero canónico data/quizzes/phase-11-tokenization.yaml. El portal (Phase 41) consume el YAML.

Source: data/quizzes/phase-11-tokenization.yaml.


q-11-01 — What is the core training objective of BPE? (single)

  • A random pair of adjacent symbols
  • The most-frequent pair of adjacent symbols in the corpus
  • The pair with the highest mutual information
  • The pair that minimizes perplexity on a held-out set

Vanilla BPE (Sennrich et al. 2016) greedily merges the most frequent pair until the vocab cap is hit.


q-11-02 — Which properties are typical of byte-level BPE? (multi)

  • It can encode any UTF-8 string without unknown tokens
  • It needs a separate <UNK> token for unseen characters
  • Its base alphabet has 256 tokens
  • Whitespace must always be a single token

Byte-level BPE starts from the 256 raw bytes, so every UTF-8 string is representable; no <UNK> required.


q-11-03 — Why pre-tokenize before BPE? (free)

Expected to contain: word.

Pre-tokenization keeps merges from crossing word/punctuation boundaries, preserving linguistic units and stabilizing vocab growth.


q-11-04 — Bytes vs codepoints on the §A13 bilingual corpus (single)

A learner trains byte-level BPE with vocab=256 on the bilingual §A13 corpus. Mean-tokens-per-English-word is 1.4; mean-tokens-per-Spanish-word is 2.1. What explains the gap?

  • Spanish words are intrinsically longer than English words
  • Accented characters cost 2 bytes each, consuming merges for parity
  • The corpus has more English than Spanish examples
  • BPE is fundamentally biased toward Germanic languages

ñ á é í ó ú ü are each 2 UTF-8 bytes. To become single tokens each needs one merge — 7 merges spent on alphabet recovery. Vocab=512 restores parity.


q-11-05 — Find the bug: tokenizer trained on wrong corpus (free)

Expected to contain: spanish.

Spanish text fragments into far more tokens than English (the merge schedule never learned Spanish bigrams), giving longer sequences and higher attention cost. Cross-link: break/00-break-train-english-only-bpe.md.