English · Español

00 — Motivation: a single metric is a lie¶

🇪🇸 La métrica que no es la tarea que quieres resolver es ruido. Phase 20 mueve el modelo de "minimizo cross-entropy" a "acierto en correct-vs-incorrect en los 20 verbos × 5 tiempos × 3 personas, en inglés y español, calibrado y robusto a casos truculentos".

Why perplexity isn't enough¶

Phase 18's training optimized cross-entropy. Phase 18's report says "val PPL 7.2, beat baseline 41.6." That's a real result. It is not, however, the result Borja built MiniGPT to produce.

The actual deployment task (per LYNX_CORTEX_ADDENDUM.md §A13) is English verb grammar tutoring: given an English sentence with a verb in it, decide whether the conjugation is correct for the sentence's person and tense, and (in Phase 32) propose the right form. A model with low perplexity can still fail any of those sub-tasks if its loss-floor improvements came from confidently predicting common tokens (the, she, I, sentence-final periods) and not from learning the verb-form distinctions.

This is the proxy-metric trap: optimizing X because X is differentiable, then declaring success without checking whether X aligns with Y, the actual goal.

For a language model on a tiny corpus:

Perplexity says: "can the model predict the next token."
Grammar accuracy says: "does the model produce the right verb form for the right person and tense."

These are correlated but not identical. A model can have perplexity 7 and grammar accuracy 50% (chance for a 2-class problem). The reverse is harder — high grammar accuracy at high perplexity is unlikely — but the correlation isn't 1.0.

The four metric buckets this phase ships¶

LYNX_CORTEX.md §4 PHASE 20 names: perplexity, task accuracy, pass@k, calibration, adversarial eval. Phase 20 builds all five (treating "pass@k" and "task accuracy" as one bucket since this curriculum's task is classification of a finite set of candidate forms, not free-form generation).

Conceptually, they break into four:

Perplexity — language-modeling proxy. Cheap, comparable across runs, but indirect for our task.
Per-slice classification accuracy — the task metric. For each slice (verb, tense, person, language, regularity): given a sentence with a verb form, classify as correct / incorrect / ambiguous. Per-slice reporting is essential — overall accuracy can hide a model that's great on regular present-simple and useless on irregular past tense.
Calibration — when the model says "90% confident this is correct", is it right 90% of the time? Over-confidence is a downstream failure mode that pure accuracy doesn't measure, and it matters because Phase 32's tutor agent acts on the model's confidence.
Adversarial slice — the hard examples (hyper-regularization like goed, wrong-person like She work, EN↔ES form mismatch). A model that's 99% on easy and 50% on hard is a 75% model in the wild; report both.

The eval REPORT.md is structured so you can read each metric in 5 seconds and the integrated story in 1 minute.

Why labeled probes (and not just a held-out test set)¶

Phase 12's corpus has paired sentences (English ↔ Spanish) with grammatical metadata. Why don't we just compute accuracy on the held-out test split?

Because the held-out test set is a random sample of the corpus distribution, and the corpus distribution is what gen_corpus.py produced. If the generator emits a lot of present-simple-3sg work sentences, the test set is mostly those, and overall accuracy is dominated by them. The model could be ignorant of past-participle written and the aggregate number wouldn't move.

A probe set is engineered to cover the task uniformly:

Equal counts per (tense, person) cell (so per-cell accuracy is comparable).
Equal balance of regular vs irregular verbs (the regular/irregular ratio in natural language is skewed; the probe set isn't).
Equal counts in English and Spanish (so neither language dominates the aggregate).
Inclusion of known-tricky cases that the random generator might not produce by chance.

Probe sets are common in research literature (HumanEval, MMLU, Big-Bench, BLIMP for grammaticality). For Phase 20, we hand-write ~60 probes that cover the 20 verbs × 5 tenses × 3 persons grid (sampled, not exhaustive), with explicit slice tagging.

Why calibration matters separately¶

A model that's right 80% of the time can be:

Well-calibrated: when it says 80% confident, it's right 80%. When it says 99%, it's right 99%.
Over-confident: when it says 99%, it's actually only right 80%. Dangerous when downstream code trusts the model's confidence.
Under-confident: when it says 60%, it's actually right 80%. Wastes information.

For a grammar tutor agent (Phase 32), calibration is the deciding property when the model's output triggers a correction shown to a learner. A confident-wrong model that says "She works is incorrect — use She work" with 95% confidence is the worst possible failure mode: it actively misteaches.

Phase 20 measures calibration with Expected Calibration Error (ECE) and reliability diagrams. Both are derived in theory/02-metrics-math.md.

Why adversarial eval is its own slice¶

Tricky cases are the discriminators in this task. Consider:

Yesterday I goed to the store.

This is a classic over-regularization error — the learner (or model) applied the regular -ed rule to an irregular verb whose past is went. A model that just memorized "verb + ed = past" will accept goed; a model that has actually learned the irregular forms will flag it.

Adversarial probes:

Over-regularization: goed, eated, writed, comed, seed. Trivially detectable by a model that knows irregular forms; trap for a model that overfit the regular rule.
Wrong-person agreement: She work hard. (missing -s). He go to school. It have a name.
Wrong-tense for time marker: Yesterday I work. (present-simple instead of past-simple). Tomorrow she went. (past instead of future).
Auxiliary mismatch: She have eat instead of She has eaten. Tests the perfect-aspect chain.
EN↔ES form mismatch: an English prompt expecting worked, but the candidate set contains trabajó (the Spanish form). A model that conflates the languages will mispick.
Plural confusion: We work (correct — but the model trained only on singular per §A13 might reject it as "wrong-person"). This is an out-of-scope probe used to surface generalization gaps.

Phase 20 carries ~20 such probes in data/eval/adversarial.jsonl. Scores on this set, broken down by trick category, are the real test.

How the report flows¶

Each checkpoint produces one REPORT.md. Its sections:

Header — checkpoint hash, training step, parent manifest hash, eval date, eval set version hash.
Headline numbers — PPL (train, val, test), overall classification accuracy, ECE, adversarial accuracy. Three to five numbers.
Per-slice tables — accuracy by (language, tense, person, regularity). Sorted by accuracy ascending; the model's weakest slice is at the top.
Reliability diagram — embedded PNG.
Adversarial slice — table of trick categories × accuracy.
Confusion matrix — correct / incorrect / ambiguous (or correct vs. each error type).
Pass@k for the generation sub-eval (if applicable: free-form conjugation completion).
One-paragraph interpretation — Borja writes this. What does this model do well; what does it not.

The interpretation in §8 is the point. The numbers exist to support the sentence Borja writes about what the model can and can't do.

What's NOT in Phase 20¶

No fine-tuning to fix what's broken. Phase 28's LoRA work targets specific weaknesses surfaced here.
No agent-level eval. Phase 32. The tutor-agent's behavior is tested with this model + a prompt + tools; Phase 20 only tests the base model.
No RAG eval. Phase 29. The base model has no retrieval here.
No hyperparameter search informed by metrics. Just measurement.

What "done" feels like¶

You'll know Phase 20 is over when:

experiments/20-eval-report/REPORT.md exists with all eight sections filled.
You can name the model's three worst-performing slices (e.g., "irregular past-simple 1sg", "Spanish past-participle 3sg", "future-tense going to 2sg") and articulate why you suspect each (without retraining).
The reliability diagram is in the report and you can read it: is the model well-calibrated, over, or under?
You have at least one adversarial example where you can demonstrate the model fails in a predictable way (related to a corpus gap or a missing pattern — e.g., the corpus over-represented work, and now the model hyper-regularizes go).

If you can do those four things, the harness is built and the model is characterized. Phase 21 then uses this characterization to decide which sampling strategy to use for generation.

🇪🇸 La fase 20 es el espejo: no cambia el modelo, sólo lo describe honestamente. Cambiar el modelo es un trabajo de otras fases.

Next: theory/01-metrics-catalog.md.