English · Español
01 — Metrics catalog: what each metric measures, what it doesn't¶
🇪🇸 Cada métrica responde una pregunta distinta. Conocer la pregunta es lo que evita reportar accuracy alta sobre un test set leaked.
Perplexity (PPL)¶
Question answered: how well does the model predict the next token under the data distribution?
Formula: \(\text{PPL} = \exp(\bar{\mathcal{L}}) = \exp(-\frac{1}{N}\sum_i \log p_\theta(y_i \mid x_{<i}))\) on a held-out set.
Interpretation: the "effective branching factor" the model is uncertain over. PPL = 7 means the model is, on average, choosing among 7 equally-likely next tokens. The lower the better; lower bounds at 1 (perfect prediction).
Strength: cheap, comparable across runs, well-understood.
Weakness: measures the proxy (CE loss), not the deployment task. A model that's good at predicting articles and pronouns but bad at deciding "is She work correct" can have low PPL.
When to report: always. PPL is the lingua franca; every eval report includes it. Report on train, val, and test splits separately. Also report PPL split by language (EN vs ES) — a model that's much worse on Spanish has a real bilingual deficit that the aggregate would hide.
When to NOT make decisions based on PPL alone: when the downstream task is classification, RAG, or agent-driven action. In those cases, PPL is a sanity check, not a verdict.
Per-slice classification accuracy¶
Question answered: for each grammatical slice — verb, tense, person, language, regularity — can the model correctly classify a sentence's verb form as correct / incorrect / ambiguous?
Formula: \(\text{Acc}(s) = \frac{1}{|D_s|}\sum_{(x,y) \in D_s} \mathbb{1}[\hat y = y]\)
Where \(D_s\) is the slice of probes matching the slicing criterion \(s\). The slicing dimensions used in Phase 20:
- By verb (20 cells):
work, play, walk, talk, listen, watch, study, finish, start, look, want, like, be, have, do, go, come, see, eat, write. - By tense (5 cells): infinitive, present simple, past simple, past participle, future (
will/going to). - By person (3 cells): 1sg
I, 2sgyou, 3sghe/she/it. - By language (2 cells): EN, ES.
- By regularity (2 cells): regular (12 verbs), irregular (8 verbs).
Interpretation: per-slice accuracy tells you which combinations the model knows and which it doesn't. The two highest-leverage slices are (a) regularity (does the model handle irregular forms?) and (b) tense × person (does present-simple 3sg -s work? does past-participle work?).
How the model is queried for classification: the model is a language model, not a classifier. To get a class prediction:
- Constrained-prompt method: prepend a fixed prompt like
// Task: classify the grammaticality of the verb form below. Categories: CORRECT | INCORRECT | AMBIGUOUS.\nSentence: <text>\nClassification:, then sample the next token from the constrained alphabet{CORRECT, INCORRECT, AMBIGUOUS}. The token with highest probability is the prediction. - Multiple-choice method: for a sentence with a verb-form blank, score each candidate (e.g.,
work / works / worked / working) by its conditional likelihood under the model; pick argmax. - Confidence: the softmax probability assigned to the predicted class.
Strength: directly task-aligned, granular by slice.
Weakness: depends on the prompt format. If you change the prompt mid-evaluation, results aren't comparable. Pin the prompt in data/eval/probe_prompt.txt.
When to report: always. Per-slice tables and bar charts are the most-read panels of the eval report.
Aggregate classification accuracy¶
Question answered: overall, what fraction of probes does the model classify correctly?
Formula: \(\text{Acc} = \frac{1}{|D|}\sum_{(x,y) \in D} \mathbb{1}[\hat y = y]\) (no slicing).
Interpretation: a single number summarizing the per-slice set.
Strength: easy to compare across runs.
Weakness: hides slice disparities. A model with 90% on EN-regular-present and 30% on ES-irregular-past has aggregate ~70% — useless number. Always report alongside per-slice tables and the confusion matrix.
When to report: always, but never alone.
Confusion matrix¶
Question answered: when the model is wrong, what is it confusing for what?
Formula: a 3×3 matrix where row = true label, column = predicted label. Entries are counts. Classes: correct, incorrect, ambiguous.
Interpretation: the matrix diagonal is correct; off-diagonal entries reveal systematic confusions. For example, if the (true=correct, predicted=incorrect) entry is high on the irregular slice, the model is over-applying the regular -ed rule. If the (true=incorrect, predicted=correct) entry is high on the wrong-person slice, the model isn't enforcing 3sg -s.
Strength: shows the direction of errors, not just their count.
Weakness: doesn't scale to many classes. Three classes is comfortable; per-verb (20) or per-(tense, person) (15) would be unreadable. Use slicing tables instead.
When to report: always. Embed as a small table or a small PNG.
Pass@k (for generation sub-eval)¶
Question answered: if the model generates \(k\) samples completing a verb-form blank, what is the probability that at least one is the correct conjugation?
Formula: an unbiased estimator from \(n \ge k\) samples with \(c\) correct: $\(\widehat{\text{pass@}k} = 1 - \binom{n - c}{k} / \binom{n}{k}\)$
For \(k=1\): \(\widehat{\text{pass@}1} = c/n\) (just accuracy). For \(k=10\): probability that any of 10 samples is correct.
Interpretation: at temperature T=0.7, the model's "best-of-k" performance. Used in code-generation literature because for generative tasks, occasionally-correct is still useful.
Strength: measures the model's diverse capability, not just its single best guess.
Weakness: for our task — picking the right verb form — pass@k > 1 mostly tells you that the model considered the right answer somewhere in its top-k. The Phase 32 tutor wants pass@1 to be high (one confident, correct suggestion); pass@10 is more of a "did it know the answer at all" signal.
When to report: when the eval includes a generation component (free-form completion of a verb-form blank). Phase 20's generation sub-eval is small (10 prompts), so pass@k is a sanity check, not the headline.
Expected Calibration Error (ECE)¶
Question answered: when the model says "x% confident this verb form is correct", is it right x% of the time?
Formula: $\(\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|\)$
where \(B_m\) is the bin of predictions with confidence in \([(m-1)/M, m/M]\), \(\text{acc}(B_m)\) is the empirical accuracy of predictions in that bin, \(\text{conf}(B_m)\) is the average confidence in that bin.
Default \(M = 10\).
Interpretation: ECE = 0 means perfect calibration. ECE = 0.1 means, on average, the model's confidence is off by 10 percentage points. ECE = 0.3 is poorly calibrated.
Companion visualization: the reliability diagram, plotting bin's average confidence (x) vs bin's empirical accuracy (y). The diagonal is perfect; deviation below means over-confident, above means under-confident.
Strength: measures a property pure accuracy cannot.
Weakness: sensitive to bin choice. With few samples per bin, ECE estimates are noisy. Phase 20's 60-100 probes means each bin has 6-10 examples — borderline; report alongside Brier as a second opinion.
When to report: always.
Brier score¶
Question answered: how close to the true label-as-probability is the model's confidence?
Formula: \(\text{Brier} = \frac{1}{N}\sum_i (p_i - y_i)^2\) where \(p_i \in [0,1]\) is the predicted probability of the positive class and \(y_i \in \{0, 1\}\) is the ground truth.
Interpretation: lower is better. Decomposes into a calibration term (close to ECE) and a refinement term (how sharp the distribution is). Bin-free.
Strength: bin-free, so robust to small sample sizes.
Weakness: less intuitive than ECE. Reported alongside ECE, not instead of.
When to report: always.
Adversarial slice score¶
Question answered: how does the model handle the known-hard grammatical cases?
Formula: classification accuracy restricted to \(D_\text{adv}\), the adversarial probe subset, broken down by trick category.
Trick categories (per theory/00-motivation.md):
- Over-regularization:
goed,eated,writed. Tests whether the model has memorized irregular forms or just over-applied the regular rule. - Wrong-person agreement:
She work,He go. Tests 3sg-senforcement. - Wrong-tense for time marker:
Yesterday I work(should beworked). Tests temporal-context tracking. - Auxiliary mismatch:
She have eat(should beShe has eaten). Tests perfect-aspect chain. - EN↔ES form mismatch: English prompt, Spanish candidate. Tests language separation.
- Plural / out-of-scope:
We work— per §A13, plurals are out of scope. A probe asking the model to classify this is unfair, but the response tells us whether the model fails gracefully (predicts ambiguous) or confidently wrong.
Interpretation: a model with 90% accuracy on the clean set and 50% on adversarial has the typical failure profile — fine on average, broken on edge cases. The tutor agent (Phase 32) will surface those edge cases; this score predicts how often.
Strength: maps directly to deployment risk.
Weakness: adversarial probes are hand-crafted; the harness inherits the bias of whoever crafted them. We seed the construction so the adversarial set itself is reproducible, but its representativeness is a known limitation.
When to report: always. By trick category, not just aggregate.
What we DO NOT report (and why)¶
- F1 score — for 3-class classification, macro-F1 vs micro-F1 vs weighted-F1 has subtle interpretation differences. The combination of accuracy + confusion matrix + per-slice table conveys everything F1 would, more clearly.
- AUROC — needs binary classes. We have three. Plot per-class one-vs-rest if curious, but not a default.
- BLEU / chrF — would apply if the task were free-form translation. The §A13 scope is much narrower (pick the right form), so per-slice accuracy is the right metric.
- Loss landscape visualizations — not actionable.
- Attention heatmaps — those are Phase-15 / Phase-19 artifacts, not Phase-20 metrics.
One-paragraph recap¶
Eight metrics total, falling into four buckets: PPL (proxy), per-slice & aggregate accuracy (task), ECE & Brier & reliability diagrams (calibration), adversarial-slice score (worst-case). Every eval REPORT.md has all eight. No metric stands alone; the interpretation paragraph at the end synthesizes them into a sentence about what the model can and can't do.
Next: theory/02-metrics-math.md.