English · Español

Phase 20 — Quizzes (mirror)¶

🇪🇸 Las preguntas canónicas viven en data/quizzes/phase-20-evaluation-harness.yaml.

q-20-01 — Why PPL alone is misleading at §A13 scale¶

Prompt (EN): Which of the following is the strongest reason perplexity alone is a weak progress indicator on the §A13 corpus?

A. The corpus is bilingual, and PPL is not defined for bilingual data.
B. The conjugation-critical tokens are a small subset of the vocabulary; PPL averages over all tokens and dilutes the signal.
C. PPL is unbounded above; small models cannot achieve low PPL.
D. The model is too small to assign meaningful probabilities.

Correct: B. PPL is the geometric average of all-token negative log-probs; the conjugation tokens are a minority. A model that nails filler tokens and fails on conjugation can still report a "good" PPL.

q-20-02 — Train-as-eval failure¶

Prompt (EN): A learner reports CCR-irregular = 87% on the §A13 grammar tutor. Investigation shows the eval was run on data/train/train.jsonl. What conclusion is appropriate?

A. The model has generalized excellently to irregular conjugation.
B. The reported number is a measure of memorization, not generalization, and is uninformative for ship/no-ship decisions.
C. The model overfitted to held-out data.
D. The eval harness is broken and the model's true CCR is unknown.

Correct: B. Evaluating on train measures memorization. The eval harness still functions correctly; the number is meaningful in a narrow sense (memorization capacity), just not for the deployment decision.

q-20-03 — Companion metrics for PPL¶

Prompt (EN): Select every metric that is a useful companion to PPL for the §A13 grammar tutor.

A. Conjugation-correctness rate (CCR) sliced by regularity.
B. Token-level conditional accuracy restricted to the conjugation-critical alphabet.
C. Bilingual alignment accuracy.
D. Wall-clock training time.

Correct: A, B, C. Wall-clock time is an operational metric, not a quality one — useful for budgeting, not for evaluating model output.

q-20-04 — Probe leak¶

Prompt (EN): In one or two sentences, describe a probe-level data leak that is not caught by simply ensuring train.jsonl and val.jsonl are disjoint files.

Free response. Expected mentions: prompt prefix in train; surface text overlap; verb form previously seen; probe construction must avoid prefixes that appear in train.

q-20-05 — Calibration vs accuracy¶

Prompt (EN): A model reports CCR = 80% but the average confidence (softmax max-prob) on its predictions is 99%. What kind of calibration does this indicate?

A. Well-calibrated.
B. Under-confident.
C. Over-confident.
D. Cannot determine from this data.

Correct: C. Over-confident: the model assigns 99% to its argmax but is only correct 80% of the time. Expected calibration error is large. Phase 20's 02-calibration-and-adversarial.md covers the mitigation.