English · Español

Phase 20 — Evaluation Harness¶

Requires: 19 — Training Dynamics & Debugging Teaches: evaluation · accuracy-probes · calibration · adversarial-eval · bootstrap-ci Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Después de entrenar, medir. Perplexity sola no es suficiente — el modelo debe demostrar que las 20 verbos × 5 tiempos × 3 personas son su universo nativo, en inglés y español. Aquí construimos el banco de pruebas específico al dominio.

Goal¶

Move from "the loss went down" to "the model is good at the task we built it for." Measure perplexity, per-slice classification accuracy, calibration, and adversarial robustness on a hand-labeled probe set covering the English verb grammar microscopic scope (per LYNX_CORTEX_ADDENDUM.md §A13): 20 verbs × 5 tenses × 3 singular persons, with paired Spanish forms.

By the end of Phase 20, Borja can point at a specific checkpoint and say, with numbers: "This model is 87% accurate on correct-vs-incorrect overall, 95% on regular verbs in present simple, only 71% on irregular past tense, well-calibrated up to confidence 0.8 and over-confident above, and robust to most adversarial tricks except hyper-regularization (e.g., goed, eated) where it falls to 40%."

Read order¶

theory/00-motivation.md — why a single metric is a lie.
theory/01-metrics-catalog.md — perplexity, classification accuracy, pass@k, calibration, adversarial — what each measures, what it doesn't.
theory/02-metrics-math.md — derivations: PPL from CE, pass@k estimator, ECE binning, Brier score.
theory/03-probe-construction.md — how to build a fair probe set, eval-hygiene rules, leak prevention.
lab/00-probe-schema.md — define the probe-set schema and load 50 labeled examples.
lab/01-harness-perplexity-accuracy.md — wire the basic harness; compute PPL + per-slice accuracy.
lab/02-calibration-and-adversarial.md — add calibration metrics and an adversarial slice.
lab/03-report-and-checkpoint-compare.md — produce REPORT.md per checkpoint; compare two checkpoints side-by-side.

solutions/ is empty during pre-write — populated at phase open after Borja's harness API is fixed.

Definition of Done¶

See PHASE_20_PLAN.md §6. Briefly:

data/eval/probes.jsonl with ≥ 60 labeled examples covering all 20 verbs across both languages.
data/eval/adversarial.jsonl with ≥ 20 tricky cases (hyper-regularization, wrong-person, wrong-tense, EN/ES form mismatch).
experiments/20-eval-report/REPORT.md for the Phase-18 final checkpoint AND the best-val (Phase-19 overfit) checkpoint.
Per-slice bar charts + reliability diagram + adversarial scores committed.
A written statement of which slices the model handles well or poorly — broken down by language (EN vs ES), regularity (regular vs irregular), tense, and person.

What this phase intentionally does NOT cover¶

Training. Phase 18-19. Phase 20 reads existing checkpoints; it does not train.
Sampling / generation. Phase 21. We use greedy or temperature-0 outputs only here for classification; full generation evaluation is the next phase.
Eval against a much bigger corpus. The probe set is the whole point of being labeled, hand-curated. Scaling probes to thousands is a different exercise.
RAG eval. Phase 29.
Agent eval. Phase 32, with its own grammar-tutor-driven probes.
PyTorch. Phase 24.

Phase 20's scope is measuring the Phase-18-trained model against the curriculum's actual goal (English verb grammar conjugation, with Spanish pairs). Nothing more.