English · Español
06 — The capstone evaluation rubric; how to read a score breakdown¶
🇪🇸 La rúbrica del capstone es la misma que usa el módulo A4 del portal de la Fase 41 para evaluar respuestas de quiz y exam: cuatro dimensiones, cada una con puntuación binaria o por escala corta, ponderadas para producir un score 0-1. Si lees un breakdown, sabes en cuál dimensión hay deuda — no necesitas mirar los outputs uno a uno.
Why one rubric for both capstone and portal¶
The Phase 39 capstone exam (data/exams/phase-39-capstone.yaml and the optional phase-39-capstone-extra.yaml) is the same shape used by the Phase 41 portal's quiz/exam evaluator (see docs/phase-41-learner-portal/theory/04-quizzes-and-exams.md — do not modify). Sharing the rubric is deliberate:
- One source of truth. A learner's portal exam score is computed the same way as the capstone gate's score. No surprises at the gate.
- Composability. The portal can replay capstone items in spaced-repetition mode (Phase 41's SM-2 scheduler) and the same correctness signal applies.
- Debuggability. When a score is unexpectedly low, the dimension breakdown tells you whether it's a correctness, a normalization, or a Spanish-gloss issue.
The four dimensions¶
The rubric scores every tutor output on four axes. Maximum 1.0 per axis, weighted to sum to 1.0 overall:
| Axis | Weight | What it measures | Type |
|---|---|---|---|
| correctness | 0.60 | Does the English correction match the §A13 ground-truth table? | Binary (0 or 1) |
| spanish_gloss | 0.25 | Does the Spanish translation include the load-bearing inflected form? | Binary (0 or 1) |
| schema | 0.10 | Does the response conform to the JSON schema (Phase 30) — required keys, types? | Binary (0 or 1) |
| conciseness | 0.05 | Is the explanation under the word budget (≤ 30 words)? | Binary (0 or 1) |
Per-item score: \(s = 0.60 \cdot c + 0.25 \cdot g + 0.10 \cdot \sigma + 0.05 \cdot \kappa \in [0, 1]\).
Exam aggregate: arithmetic mean of per-item scores.
Why this weighting¶
- Correctness dominates (0.60). The tutor exists to correct grammar. A grammatically wrong answer can't be compensated for by a beautiful Spanish gloss.
- Spanish gloss is heavy (0.25). It is the second pillar of §A13's bilingual policy (A2). A correct-English-only answer is incomplete.
- Schema is small but nonzero (0.10). Phase 30 structured output is enforced by the schema validator; if the response gets past the validator, schema is usually 1. The weight exists so unstructured-fallback answers cost something.
- Conciseness is a tiebreaker (0.05). Explanations that ramble are penalized to encourage tight outputs without dominating the score.
The weights sum to 1.0 by construction. Changing them is a one-way door — re-calibrate against historical exam scores if you do.
How to read a score breakdown¶
The portal and capstone runner both emit per-item JSON:
{
"item": "e-39-01",
"prompt": "Conjugate 'eat' — past simple, 3rd singular.",
"expected": ["ate"],
"got": "Ate.",
"score": 0.95,
"breakdown": {
"correctness": 1.0,
"spanish_gloss": 0.0,
"schema": 1.0,
"conciseness": 1.0
},
"weighted": {
"correctness": 0.60,
"spanish_gloss": 0.00,
"schema": 0.10,
"conciseness": 0.05
}
}
Reading this: - correctness = 1.0 → the English answer matched (after normalization). - spanish_gloss = 0.0 → no Spanish was returned for this item (or it didn't contain the expected substring). - schema + conciseness = 1.0 → format fine. - Aggregate = 0.95. The 0.05 gap is entirely the missing Spanish.
That is the level of insight a breakdown should give. If you only see a single number, you cannot fix the system; with the breakdown, the action item is "the tutor stops emitting Spanish on past-simple items, debug why".
Reading patterns across items (the dashboard view)¶
For an exam of N items, plot the four sub-scores as a stacked bar per item. Patterns to look for:
- Column missing across items → the model has a systematic miss (e.g., all "schema" zeros → the JSON formatter is broken).
- Row of zeros for a specific verb → the verb is mis-learned (e.g., all
goitems are wrong → check the corpus for that verb). - Conciseness alone trending down → the model is verbose; the prompt template may have changed.
- One outlier item → check the normalization (this is what the
/breakexercise is about).
Normalization rules (the source of false negatives)¶
For binary axes (correctness, spanish_gloss), the comparison is done after a normalization step:
- Strip whitespace at start/end.
- Lowercase the string.
- Strip a trailing period if present.
- Substring match on
expected_containsentries — every required substring must appear in the normalized output.
If any of these steps is missing or broken, a correct answer can score zero. The /break exercise removes the case-fold step and demonstrates the resulting false-negative cascade.
Calibrating the rubric¶
Once you have a stable rubric, you should also have a "stable distribution" — the score the current production tutor achieves on the regression set. The eval gate (Phase 38 theory 06) compares new releases against this stable distribution:
- 0.95+ on the capstone exam — the production target.
- ≤ 2% drop on the regression set — the eval-gate threshold.
If a release jumps from 0.95 to 0.98, that's a real improvement — but verify it's not a rubric bug (Phase 38 should also CI the rubric itself). If a release drops from 0.95 to 0.85, that's a regression — block. The 2% threshold leaves room for noise without papering over real problems.
The capstone gate (the §A13 DoD bar)¶
For Phase 39 to close, the tutor must score ≥ 0.92 on the capstone exam (data/exams/phase-39-capstone.yaml + phase-39-capstone-extra.yaml if it exists). This is the hard gate.
Soft gates (warn but don't block): - p50 latency ≤ 250 ms (from Phase 33 theory 05's budget). - Zero schema failures on the regression set. - Eval gate ≥ 0.93 on the broader regression set.
The hard gate is binary; the soft gates are signals for the Phase 40 hardening pass.
What this chapter does NOT cover¶
- Inter-annotator agreement for the gold labels — irrelevant at §A13 scope; the table is canonical.
- Adversarial eval — Phase 37 covers prompt-injection eval separately.
- Held-out exam item generation — see
docs/phase-41-learner-portal/theory/04-quizzes-and-exams.mdfor the portal's content pipeline. - Token-level evaluation (BLEU, ROUGE) — not used here; the §A13 scope is small enough for exact-match rubrics.
Reference¶
- Liang et al., "Holistic Evaluation of Language Models (HELM)" (Stanford CRFM, 2022). The multi-axis evaluation pattern this rubric follows.
- Eberle et al., "On the Use of Aggregate Scores in NLP Evaluation" (2022). Why per-axis breakdowns beat single aggregates.