English · Español

04 — Why perplexity overstates progress on a tiny corpus¶

🇪🇸 La perplejidad es la métrica favorita por una razón: es barata y comparable. Pero a la escala §A13 (240 frases de entrenamiento, ~600 formas totales), la PPL miente sistemáticamente sobre el progreso. Aquí vemos cómo, por qué, y qué medir en su lugar.

This page extends theory/01-metrics-catalog.md with the §A13-specific reasons perplexity is misleading, and lists the metrics we use instead (or alongside) to get an honest picture.

The setup¶

§A13 corpus: 20 verbs × 5 tenses × 3 persons = 300 (verb, tense, person) cells, each producing one English form and its Spanish translation. Total ~600 forms. Training set: 240 sentences (each embeds one form into a small frame). Vocabulary: ~512 BPE tokens. Mini-GPT: ~103k parameters.

A "well-trained" §A13 model achieves training PPL \(\approx 2.5\) and held-out PPL \(\approx 4.0\). These numbers look great compared to GPT-2 (which has training PPL \(\approx 20\) on its 40GB corpus). The numbers are not comparable, and confusing them is the first failure mode.

Why PPL is misleading here — three reasons¶

1. The vocabulary is small, so the lower bound is small¶

PPL bounded above by \(V\) (uniform random) and below by 1 (perfect). For our vocab \(V = 512\), an uninformed model has PPL = 512. A model that knows just the unigram frequencies (the most common BPE tokens are the language separators and articles) gets to PPL \(\approx 30\) for "free" — no grammar learned at all. Most of the gap from 512 to 4.0 is easy learning (unigram, bigram); the hard part — verb conjugation — is the last factor-of-2 from 8 to 4.

A 10% improvement in PPL at this scale (4.0 → 3.6) might correspond to a 50% improvement in the grammar task or a 5% improvement, depending on which tokens were learned better. PPL doesn't tell you.

2. Conjugation lives on a small subset of tokens¶

Of the ~512 BPE tokens, only ~30 are "conjugation-critical": the verb-stem fragments (e.g., work, wri, tten), the morphological suffixes (-s, -ed, -ing), and the irregular forms (is, was, had, gone, done). The other ~480 tokens are filler — articles, pronouns, punctuation, spaces. A model that nails the 480 filler tokens and randomly guesses the 30 conjugation tokens has PPL that's 6%–10% higher than a model that does the opposite. Both are useless. PPL collapses the difference.

3. PPL doesn't separate generalization from memorization¶

With 240 training sentences and 60 held-out sentences, a model with 103k parameters can memorize a substantial fraction of the train set. Train PPL = 1.5, val PPL = 4.0 looks like "overfitting by 2.5×". But what's actually happening: the model memorized the surface forms it saw, and on held-out forms it's guessing morphology rules from the wrong abstraction. The held-out PPL of 4.0 hides whether the model is

(a) correctly conjugating new sentences with familiar verbs (good generalization),
(b) failing on unfamiliar verbs but getting filler right (selective generalization),
© failing on unfamiliar morphology even with familiar verbs (no rule learned).

The aggregate number cannot distinguish these three regimes.

What to measure instead — three companion metrics¶

A. Token-level conditional accuracy on the conjugation-critical alphabet¶

For each held-out sentence with a verb form, compute the model's argmax prediction at the position where the conjugation token would land. Restrict the alphabet to the conjugation-critical tokens (e.g., when the position should be -s, score whether argmax over {ø, -s, -ed, -ing} is -s).

Formula:

\[\text{ConjAcc} = \frac{1}{|D|} \sum_{(x_{<i}, y_i^*) \in D_\text{conj}} \mathbb{1}\left[ \arg\max_{v \in V_\text{conj}} p_\theta(v \mid x_{<i}) = y_i^* \right]\]

Where \(V_\text{conj}\) is the conjugation-critical alphabet at position \(i\) (determined by the grammar context).

This is much harder to game. Random over a 4-element alphabet gives 25% accuracy. A model that perfectly knows the unigram distribution gets ~30%. A model that has learned the rule gets to 80%+. The gap from 30% → 80% is exactly the grammar signal we care about.

B. Conjugation-correctness rate (CCR) over the (verb × tense × person) grid¶

For each of the 300 cells in the §A13 grid, generate the form (from English prompt or by free generation), and check whether the generated form matches the canonical one. CCR is the fraction of correct cells.

Crucially, CCR is reported with slices:

CCR-regular (12 verbs × 5 × 3 = 180 cells): the model that learned the regular rule should get \(\geq 90\%\).
CCR-irregular (8 verbs × 5 × 3 = 120 cells): irregulars are memorization; expect \(\geq 75\%\) after enough training.
CCR-rare-tense (past participle only): the rarest form in the corpus; expect 60-80% after enough training.

The cross-tabulation reveals which slices the model is failing on, which PPL hides.

C. Bilingual alignment accuracy¶

For each (English form, Spanish form) pair in the §A13 grid, prompt the model to translate one way and check exact match. This is the §A2 bilingual signal — if the model has a real cross-lingual representation, alignment accuracy correlates with grammatical reasoning. If alignment is high but conjugation is low, the model memorized the dictionary without grammar.

Numerical example — what the three metrics look like for a healthy run¶

After 2000 training steps with the recommended Phase 18 config, the §A13 mini-GPT reaches roughly:

Metric	Train	Val
PPL (aggregate)	1.9	3.8
PPL (English)	1.7	3.6
PPL (Spanish)	2.0	4.0
ConjAcc	88%	76%
CCR-regular	95%	82%
CCR-irregular	87%	64%
CCR-past-participle	73%	55%
Bilingual alignment	81%	68%

The PPL gap is 1.9× (train vs val), which would scream "overfit" in the GPT-2 setting. The CCR tells a more nuanced story: the model has learned the regular rule almost perfectly on train (95%), generalizes to held-out regulars (82%) at a healthy margin, and is partway through learning the irregulars. The past-participle row is the real gap — held-out past-participles at 55% means the model has seen too few examples to generalize.

A naive "improve PPL" mindset would push for tighter training (more steps, lower LR floor). The CCR view suggests a better intervention: augment the corpus with more past-participle examples (within §A13 scope), since that's where the signal-to-data ratio is worst.

When PPL is still useful¶

Cross-checkpoint comparison within the same run: if PPL drops from 4.0 → 3.5 from checkpoint 1500 to 2000, something improved.
Quick "is training broken" signal: if PPL spikes, training is broken regardless of the conjugation metrics.
Aggregating across many model variants for a high-level sweep.

PPL just shouldn't be the only thing you report.

Citation¶

Goldberg, Y. (2017). Neural Network Methods for Natural Language Processing. Chapter 6, sections on intrinsic vs extrinsic evaluation. The argument that "intrinsic metrics like PPL are weak proxies for extrinsic task quality" is the textbook source for the framing on this page.

One-paragraph recap¶

At §A13 scale, perplexity overstates progress for three reasons: the small vocabulary makes the "easy" learning cheap, the conjugation-critical token subset is small relative to the filler tokens, and the aggregate PPL collapses memorization and generalization. The fixes are companion metrics: token-level conditional accuracy on the conjugation-critical alphabet, conjugation-correctness rate sliced by verb / tense / person / regularity, and bilingual alignment accuracy. A healthy §A13 run has train-PPL ≈ 1.9 and val-PPL ≈ 3.8, but the CCR breakdown is what tells you whether the model learned the rule or memorized the surface.

Cross-refs: theory/01-metrics-catalog.md (the full metrics catalog), theory/02-metrics-math.md (the formulas), Phase 32 — the grammar tutor's success criterion is CCR, not PPL.