English · Español

Phase 11 — Quiz (human-readable mirror)¶

🇪🇸 Espejo legible del fichero canónico data/quizzes/phase-11-tokenization.yaml. El portal (Phase 41) consume el YAML.

Source: data/quizzes/phase-11-tokenization.yaml.

q-11-01 — What is the core training objective of BPE? (single)¶

A random pair of adjacent symbols
The most-frequent pair of adjacent symbols in the corpus ✓
The pair with the highest mutual information
The pair that minimizes perplexity on a held-out set

Vanilla BPE (Sennrich et al. 2016) greedily merges the most frequent pair until the vocab cap is hit.

q-11-02 — Which properties are typical of byte-level BPE? (multi)¶

It can encode any UTF-8 string without unknown tokens ✓
It needs a separate <UNK> token for unseen characters
Its base alphabet has 256 tokens ✓
Whitespace must always be a single token

Byte-level BPE starts from the 256 raw bytes, so every UTF-8 string is representable; no <UNK> required.

q-11-03 — Why pre-tokenize before BPE? (free)¶

Expected to contain: word.

Pre-tokenization keeps merges from crossing word/punctuation boundaries, preserving linguistic units and stabilizing vocab growth.

q-11-04 — Bytes vs codepoints on the §A13 bilingual corpus (single)¶

A learner trains byte-level BPE with vocab=256 on the bilingual §A13 corpus. Mean-tokens-per-English-word is 1.4; mean-tokens-per-Spanish-word is 2.1. What explains the gap?

Spanish words are intrinsically longer than English words
Accented characters cost 2 bytes each, consuming merges for parity ✓
The corpus has more English than Spanish examples
BPE is fundamentally biased toward Germanic languages

ñ á é í ó ú ü are each 2 UTF-8 bytes. To become single tokens each needs one merge — 7 merges spent on alphabet recovery. Vocab=512 restores parity.

q-11-05 — Find the bug: tokenizer trained on wrong corpus (free)¶

Expected to contain: spanish.

Spanish text fragments into far more tokens than English (the merge schedule never learned Spanish bigrams), giving longer sequences and higher attention cost. Cross-link: break/00-break-train-english-only-bpe.md.