English · Español

01 — Metrics catalog: what each metric measures, what it doesn't¶

🇪🇸 Cada métrica responde una pregunta distinta. Conocer la pregunta es lo que evita reportar accuracy alta sobre un test set leaked.

Perplexity (PPL)¶

Question answered: how well does the model predict the next token under the data distribution?

Formula: $\text{PPL} = \exp(\bar{\mathcal{L}}) = \exp(-\frac{1}{N}\sum_i \log p_\theta(y_i \mid x_{<i}))$ on a held-out set.

Interpretation: the "effective branching factor" the model is uncertain over. PPL = 7 means the model is, on average, choosing among 7 equally-likely next tokens. The lower the better; lower bounds at 1 (perfect prediction).

Strength: cheap, comparable across runs, well-understood.

Weakness: measures the proxy (CE loss), not the deployment task. A model that's good at predicting articles and pronouns but bad at deciding "is She work correct" can have low PPL.

When to report: always. PPL is the lingua franca; every eval report includes it. Report on train, val, and test splits separately. Also report PPL split by language (EN vs ES) — a model that's much worse on Spanish has a real bilingual deficit that the aggregate would hide.

When to NOT make decisions based on PPL alone: when the downstream task is classification, RAG, or agent-driven action. In those cases, PPL is a sanity check, not a verdict.

Per-slice classification accuracy¶

Question answered: for each grammatical slice — verb, tense, person, language, regularity — can the model correctly classify a sentence's verb form as correct / incorrect / ambiguous?

Formula: $\text{Acc}(s) = \frac{1}{|D_s|}\sum_{(x,y) \in D_s} \mathbb{1}[\hat y = y]$

Where $D_s$ is the slice of probes matching the slicing criterion $s$. The slicing dimensions used in Phase 20:

By verb (20 cells): work, play, walk, talk, listen, watch, study, finish, start, look, want, like, be, have, do, go, come, see, eat, write.
By tense (5 cells): infinitive, present simple, past simple, past participle, future (will / going to).
By person (3 cells): 1sg I, 2sg you, 3sg he/she/it.
By language (2 cells): EN, ES.
By regularity (2 cells): regular (12 verbs), irregular (8 verbs).

Interpretation: per-slice accuracy tells you which combinations the model knows and which it doesn't. The two highest-leverage slices are (a) regularity (does the model handle irregular forms?) and (b) tense × person (does present-simple 3sg -s work? does past-participle work?).

How the model is queried for classification: the model is a language model, not a classifier. To get a class prediction:

Constrained-prompt method: prepend a fixed prompt like // Task: classify the grammaticality of the verb form below. Categories: CORRECT | INCORRECT | AMBIGUOUS.\nSentence: <text>\nClassification:, then sample the next token from the constrained alphabet {CORRECT, INCORRECT, AMBIGUOUS}. The token with highest probability is the prediction.
Multiple-choice method: for a sentence with a verb-form blank, score each candidate (e.g., work / works / worked / working) by its conditional likelihood under the model; pick argmax.
Confidence: the softmax probability assigned to the predicted class.

Strength: directly task-aligned, granular by slice.

Weakness: depends on the prompt format. If you change the prompt mid-evaluation, results aren't comparable. Pin the prompt in data/eval/probe_prompt.txt.

When to report: always. Per-slice tables and bar charts are the most-read panels of the eval report.

Aggregate classification accuracy¶

Question answered: overall, what fraction of probes does the model classify correctly?

Formula: $\text{Acc} = \frac{1}{|D|}\sum_{(x,y) \in D} \mathbb{1}[\hat y = y]$ (no slicing).

Interpretation: a single number summarizing the per-slice set.

Strength: easy to compare across runs.

Weakness: hides slice disparities. A model with 90% on EN-regular-present and 30% on ES-irregular-past has aggregate ~70% — useless number. Always report alongside per-slice tables and the confusion matrix.

When to report: always, but never alone.

Confusion matrix¶

Question answered: when the model is wrong, what is it confusing for what?

Formula: a 3×3 matrix where row = true label, column = predicted label. Entries are counts. Classes: correct, incorrect, ambiguous.

Interpretation: the matrix diagonal is correct; off-diagonal entries reveal systematic confusions. For example, if the (true=correct, predicted=incorrect) entry is high on the irregular slice, the model is over-applying the regular -ed rule. If the (true=incorrect, predicted=correct) entry is high on the wrong-person slice, the model isn't enforcing 3sg -s.

Strength: shows the direction of errors, not just their count.

Weakness: doesn't scale to many classes. Three classes is comfortable; per-verb (20) or per-(tense, person) (15) would be unreadable. Use slicing tables instead.

When to report: always. Embed as a small table or a small PNG.

Pass@k (for generation sub-eval)¶

Question answered: if the model generates $k$ samples completing a verb-form blank, what is the probability that at least one is the correct conjugation?

Formula: an unbiased estimator from $n \ge k$ samples with $c$ correct: $$\widehat{\text{pass@}k} = 1 - \binom{n - c}{k} / \binom{n}{k}$$

For $k=1$: $\widehat{\text{pass@}1} = c/n$ (just accuracy). For $k=10$: probability that any of 10 samples is correct.

Interpretation: at temperature T=0.7, the model's "best-of-k" performance. Used in code-generation literature because for generative tasks, occasionally-correct is still useful.

Strength: measures the model's diverse capability, not just its single best guess.

Weakness: for our task — picking the right verb form — pass@k > 1 mostly tells you that the model considered the right answer somewhere in its top-k. The Phase 32 tutor wants pass@1 to be high (one confident, correct suggestion); pass@10 is more of a "did it know the answer at all" signal.

When to report: when the eval includes a generation component (free-form completion of a verb-form blank). Phase 20's generation sub-eval is small (10 prompts), so pass@k is a sanity check, not the headline.

Expected Calibration Error (ECE)¶

Question answered: when the model says "x% confident this verb form is correct", is it right x% of the time?

Formula: $$\text{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|$$

where $B_m$ is the bin of predictions with confidence in $[(m-1)/M, m/M]$, $\text{acc}(B_m)$ is the empirical accuracy of predictions in that bin, $\text{conf}(B_m)$ is the average confidence in that bin.

Default $M = 10$.

Interpretation: ECE = 0 means perfect calibration. ECE = 0.1 means, on average, the model's confidence is off by 10 percentage points. ECE = 0.3 is poorly calibrated.

Companion visualization: the reliability diagram, plotting bin's average confidence (x) vs bin's empirical accuracy (y). The diagonal is perfect; deviation below means over-confident, above means under-confident.

Strength: measures a property pure accuracy cannot.

Weakness: sensitive to bin choice. With few samples per bin, ECE estimates are noisy. Phase 20's 60-100 probes means each bin has 6-10 examples — borderline; report alongside Brier as a second opinion.

When to report: always.

Brier score¶

Question answered: how close to the true label-as-probability is the model's confidence?

Formula: $\text{Brier} = \frac{1}{N}\sum_i (p_i - y_i)^2$ where $p_i \in [0,1]$ is the predicted probability of the positive class and $y_i \in \{0, 1\}$ is the ground truth.

Interpretation: lower is better. Decomposes into a calibration term (close to ECE) and a refinement term (how sharp the distribution is). Bin-free.

Strength: bin-free, so robust to small sample sizes.

Weakness: less intuitive than ECE. Reported alongside ECE, not instead of.

When to report: always.

Adversarial slice score¶

Question answered: how does the model handle the known-hard grammatical cases?

Formula: classification accuracy restricted to $D_\text{adv}$, the adversarial probe subset, broken down by trick category.

Trick categories (per theory/00-motivation.md):

Over-regularization: goed, eated, writed. Tests whether the model has memorized irregular forms or just over-applied the regular rule.
Wrong-person agreement: She work, He go. Tests 3sg -s enforcement.
Wrong-tense for time marker: Yesterday I work (should be worked). Tests temporal-context tracking.
Auxiliary mismatch: She have eat (should be She has eaten). Tests perfect-aspect chain.
EN↔ES form mismatch: English prompt, Spanish candidate. Tests language separation.
Plural / out-of-scope: We work — per §A13, plurals are out of scope. A probe asking the model to classify this is unfair, but the response tells us whether the model fails gracefully (predicts ambiguous) or confidently wrong.

Interpretation: a model with 90% accuracy on the clean set and 50% on adversarial has the typical failure profile — fine on average, broken on edge cases. The tutor agent (Phase 32) will surface those edge cases; this score predicts how often.

Strength: maps directly to deployment risk.

Weakness: adversarial probes are hand-crafted; the harness inherits the bias of whoever crafted them. We seed the construction so the adversarial set itself is reproducible, but its representativeness is a known limitation.

When to report: always. By trick category, not just aggregate.

What we DO NOT report (and why)¶

F1 score — for 3-class classification, macro-F1 vs micro-F1 vs weighted-F1 has subtle interpretation differences. The combination of accuracy + confusion matrix + per-slice table conveys everything F1 would, more clearly.
AUROC — needs binary classes. We have three. Plot per-class one-vs-rest if curious, but not a default.
BLEU / chrF — would apply if the task were free-form translation. The §A13 scope is much narrower (pick the right form), so per-slice accuracy is the right metric.
Loss landscape visualizations — not actionable.
Attention heatmaps — those are Phase-15 / Phase-19 artifacts, not Phase-20 metrics.

One-paragraph recap¶

Eight metrics total, falling into four buckets: PPL (proxy), per-slice & aggregate accuracy (task), ECE & Brier & reliability diagrams (calibration), adversarial-slice score (worst-case). Every eval REPORT.md has all eight. No metric stands alone; the interpretation paragraph at the end synthesizes them into a sentence about what the model can and can't do.

Next: theory/02-metrics-math.md.