English · Español
Phase 20 — Quizzes (mirror)¶
🇪🇸 Las preguntas canónicas viven en
data/quizzes/phase-20-evaluation-harness.yaml.
q-20-01 — Why PPL alone is misleading at §A13 scale¶
Prompt (EN): Which of the following is the strongest reason perplexity alone is a weak progress indicator on the §A13 corpus?
- A. The corpus is bilingual, and PPL is not defined for bilingual data.
- B. The conjugation-critical tokens are a small subset of the vocabulary; PPL averages over all tokens and dilutes the signal.
- C. PPL is unbounded above; small models cannot achieve low PPL.
- D. The model is too small to assign meaningful probabilities.
Correct: B. PPL is the geometric average of all-token negative log-probs; the conjugation tokens are a minority. A model that nails filler tokens and fails on conjugation can still report a "good" PPL.
q-20-02 — Train-as-eval failure¶
Prompt (EN): A learner reports CCR-irregular = 87% on the §A13 grammar tutor. Investigation shows the eval was run on data/train/train.jsonl. What conclusion is appropriate?
- A. The model has generalized excellently to irregular conjugation.
- B. The reported number is a measure of memorization, not generalization, and is uninformative for ship/no-ship decisions.
- C. The model overfitted to held-out data.
- D. The eval harness is broken and the model's true CCR is unknown.
Correct: B. Evaluating on train measures memorization. The eval harness still functions correctly; the number is meaningful in a narrow sense (memorization capacity), just not for the deployment decision.
q-20-03 — Companion metrics for PPL¶
Prompt (EN): Select every metric that is a useful companion to PPL for the §A13 grammar tutor.
- A. Conjugation-correctness rate (CCR) sliced by regularity.
- B. Token-level conditional accuracy restricted to the conjugation-critical alphabet.
- C. Bilingual alignment accuracy.
- D. Wall-clock training time.
Correct: A, B, C. Wall-clock time is an operational metric, not a quality one — useful for budgeting, not for evaluating model output.
q-20-04 — Probe leak¶
Prompt (EN): In one or two sentences, describe a probe-level data leak that is not caught by simply ensuring train.jsonl and val.jsonl are disjoint files.
Free response. Expected mentions: prompt prefix in train; surface text overlap; verb form previously seen; probe construction must avoid prefixes that appear in train.
q-20-05 — Calibration vs accuracy¶
Prompt (EN): A model reports CCR = 80% but the average confidence (softmax max-prob) on its predictions is 99%. What kind of calibration does this indicate?
- A. Well-calibrated.
- B. Under-confident.
- C. Over-confident.
- D. Cannot determine from this data.
Correct: C. Over-confident: the model assigns 99% to its argmax but is only correct 80% of the time. Expected calibration error is large. Phase 20's 02-calibration-and-adversarial.md covers the mitigation.