English · Español

06 — The capstone evaluation rubric; how to read a score breakdown¶

🇪🇸 La rúbrica del capstone es la misma que usa el módulo A4 del portal de la Fase 41 para evaluar respuestas de quiz y exam: cuatro dimensiones, cada una con puntuación binaria o por escala corta, ponderadas para producir un score 0-1. Si lees un breakdown, sabes en cuál dimensión hay deuda — no necesitas mirar los outputs uno a uno.

Why one rubric for both capstone and portal¶

The Phase 39 capstone exam (data/exams/phase-39-capstone.yaml and the optional phase-39-capstone-extra.yaml) is the same shape used by the Phase 41 portal's quiz/exam evaluator (see docs/phase-41-learner-portal/theory/04-quizzes-and-exams.md — do not modify). Sharing the rubric is deliberate:

One source of truth. A learner's portal exam score is computed the same way as the capstone gate's score. No surprises at the gate.
Composability. The portal can replay capstone items in spaced-repetition mode (Phase 41's SM-2 scheduler) and the same correctness signal applies.
Debuggability. When a score is unexpectedly low, the dimension breakdown tells you whether it's a correctness, a normalization, or a Spanish-gloss issue.

The four dimensions¶

The rubric scores every tutor output on four axes. Maximum 1.0 per axis, weighted to sum to 1.0 overall:

Axis	Weight	What it measures	Type
correctness	0.60	Does the English correction match the §A13 ground-truth table?	Binary (0 or 1)
spanish_gloss	0.25	Does the Spanish translation include the load-bearing inflected form?	Binary (0 or 1)
schema	0.10	Does the response conform to the JSON schema (Phase 30) — required keys, types?	Binary (0 or 1)
conciseness	0.05	Is the explanation under the word budget (≤ 30 words)?	Binary (0 or 1)

Per-item score: \(s = 0.60 \cdot c + 0.25 \cdot g + 0.10 \cdot \sigma + 0.05 \cdot \kappa \in [0, 1]\).

Exam aggregate: arithmetic mean of per-item scores.

Why this weighting¶

Correctness dominates (0.60). The tutor exists to correct grammar. A grammatically wrong answer can't be compensated for by a beautiful Spanish gloss.
Spanish gloss is heavy (0.25). It is the second pillar of §A13's bilingual policy (A2). A correct-English-only answer is incomplete.
Schema is small but nonzero (0.10). Phase 30 structured output is enforced by the schema validator; if the response gets past the validator, schema is usually 1. The weight exists so unstructured-fallback answers cost something.
Conciseness is a tiebreaker (0.05). Explanations that ramble are penalized to encourage tight outputs without dominating the score.

The weights sum to 1.0 by construction. Changing them is a one-way door — re-calibrate against historical exam scores if you do.

How to read a score breakdown¶

The portal and capstone runner both emit per-item JSON:

{
  "item": "e-39-01",
  "prompt": "Conjugate 'eat' — past simple, 3rd singular.",
  "expected": ["ate"],
  "got": "Ate.",
  "score": 0.95,
  "breakdown": {
    "correctness": 1.0,
    "spanish_gloss": 0.0,
    "schema": 1.0,
    "conciseness": 1.0
  },
  "weighted": {
    "correctness": 0.60,
    "spanish_gloss": 0.00,
    "schema": 0.10,
    "conciseness": 0.05
  }
}

Reading this: - correctness = 1.0 → the English answer matched (after normalization). - spanish_gloss = 0.0 → no Spanish was returned for this item (or it didn't contain the expected substring). - schema + conciseness = 1.0 → format fine. - Aggregate = 0.95. The 0.05 gap is entirely the missing Spanish.

That is the level of insight a breakdown should give. If you only see a single number, you cannot fix the system; with the breakdown, the action item is "the tutor stops emitting Spanish on past-simple items, debug why".

Reading patterns across items (the dashboard view)¶

For an exam of N items, plot the four sub-scores as a stacked bar per item. Patterns to look for:

Column missing across items → the model has a systematic miss (e.g., all "schema" zeros → the JSON formatter is broken).
Row of zeros for a specific verb → the verb is mis-learned (e.g., all go items are wrong → check the corpus for that verb).
Conciseness alone trending down → the model is verbose; the prompt template may have changed.
One outlier item → check the normalization (this is what the /break exercise is about).

Normalization rules (the source of false negatives)¶

For binary axes (correctness, spanish_gloss), the comparison is done after a normalization step:

Strip whitespace at start/end.
Lowercase the string.
Strip a trailing period if present.
Substring match on expected_contains entries — every required substring must appear in the normalized output.

If any of these steps is missing or broken, a correct answer can score zero. The /break exercise removes the case-fold step and demonstrates the resulting false-negative cascade.

Calibrating the rubric¶

Once you have a stable rubric, you should also have a "stable distribution" — the score the current production tutor achieves on the regression set. The eval gate (Phase 38 theory 06) compares new releases against this stable distribution:

0.95+ on the capstone exam — the production target.
≤ 2% drop on the regression set — the eval-gate threshold.

If a release jumps from 0.95 to 0.98, that's a real improvement — but verify it's not a rubric bug (Phase 38 should also CI the rubric itself). If a release drops from 0.95 to 0.85, that's a regression — block. The 2% threshold leaves room for noise without papering over real problems.

The capstone gate (the §A13 DoD bar)¶

For Phase 39 to close, the tutor must score ≥ 0.92 on the capstone exam (data/exams/phase-39-capstone.yaml + phase-39-capstone-extra.yaml if it exists). This is the hard gate.

Soft gates (warn but don't block): - p50 latency ≤ 250 ms (from Phase 33 theory 05's budget). - Zero schema failures on the regression set. - Eval gate ≥ 0.93 on the broader regression set.

The hard gate is binary; the soft gates are signals for the Phase 40 hardening pass.

What this chapter does NOT cover¶

Inter-annotator agreement for the gold labels — irrelevant at §A13 scope; the table is canonical.
Adversarial eval — Phase 37 covers prompt-injection eval separately.
Held-out exam item generation — see docs/phase-41-learner-portal/theory/04-quizzes-and-exams.md for the portal's content pipeline.
Token-level evaluation (BLEU, ROUGE) — not used here; the §A13 scope is small enough for exact-match rubrics.

Reference¶

Liang et al., "Holistic Evaluation of Language Models (HELM)" (Stanford CRFM, 2022). The multi-axis evaluation pattern this rubric follows.
Eberle et al., "On the Use of Aggregate Scores in NLP Evaluation" (2022). Why per-axis breakdowns beat single aggregates.

Next: ../break/00-break-rubric-case-sensitive.md.