Skip to content

English · Español

06 — The capstone evaluation rubric; how to read a score breakdown

🇪🇸 La rúbrica del capstone es la misma que usa el módulo A4 del portal de la Fase 41 para evaluar respuestas de quiz y exam: cuatro dimensiones, cada una con puntuación binaria o por escala corta, ponderadas para producir un score 0-1. Si lees un breakdown, sabes en cuál dimensión hay deuda — no necesitas mirar los outputs uno a uno.


Why one rubric for both capstone and portal

The Phase 39 capstone exam (data/exams/phase-39-capstone.yaml and the optional phase-39-capstone-extra.yaml) is the same shape used by the Phase 41 portal's quiz/exam evaluator (see docs/phase-41-learner-portal/theory/04-quizzes-and-exams.md — do not modify). Sharing the rubric is deliberate:

  • One source of truth. A learner's portal exam score is computed the same way as the capstone gate's score. No surprises at the gate.
  • Composability. The portal can replay capstone items in spaced-repetition mode (Phase 41's SM-2 scheduler) and the same correctness signal applies.
  • Debuggability. When a score is unexpectedly low, the dimension breakdown tells you whether it's a correctness, a normalization, or a Spanish-gloss issue.

The four dimensions

The rubric scores every tutor output on four axes. Maximum 1.0 per axis, weighted to sum to 1.0 overall:

Axis Weight What it measures Type
correctness 0.60 Does the English correction match the §A13 ground-truth table? Binary (0 or 1)
spanish_gloss 0.25 Does the Spanish translation include the load-bearing inflected form? Binary (0 or 1)
schema 0.10 Does the response conform to the JSON schema (Phase 30) — required keys, types? Binary (0 or 1)
conciseness 0.05 Is the explanation under the word budget (≤ 30 words)? Binary (0 or 1)

Per-item score: \(s = 0.60 \cdot c + 0.25 \cdot g + 0.10 \cdot \sigma + 0.05 \cdot \kappa \in [0, 1]\).

Exam aggregate: arithmetic mean of per-item scores.

Why this weighting

  • Correctness dominates (0.60). The tutor exists to correct grammar. A grammatically wrong answer can't be compensated for by a beautiful Spanish gloss.
  • Spanish gloss is heavy (0.25). It is the second pillar of §A13's bilingual policy (A2). A correct-English-only answer is incomplete.
  • Schema is small but nonzero (0.10). Phase 30 structured output is enforced by the schema validator; if the response gets past the validator, schema is usually 1. The weight exists so unstructured-fallback answers cost something.
  • Conciseness is a tiebreaker (0.05). Explanations that ramble are penalized to encourage tight outputs without dominating the score.

The weights sum to 1.0 by construction. Changing them is a one-way door — re-calibrate against historical exam scores if you do.

How to read a score breakdown

The portal and capstone runner both emit per-item JSON:

{
  "item": "e-39-01",
  "prompt": "Conjugate 'eat' — past simple, 3rd singular.",
  "expected": ["ate"],
  "got": "Ate.",
  "score": 0.95,
  "breakdown": {
    "correctness": 1.0,
    "spanish_gloss": 0.0,
    "schema": 1.0,
    "conciseness": 1.0
  },
  "weighted": {
    "correctness": 0.60,
    "spanish_gloss": 0.00,
    "schema": 0.10,
    "conciseness": 0.05
  }
}

Reading this: - correctness = 1.0 → the English answer matched (after normalization). - spanish_gloss = 0.0 → no Spanish was returned for this item (or it didn't contain the expected substring). - schema + conciseness = 1.0 → format fine. - Aggregate = 0.95. The 0.05 gap is entirely the missing Spanish.

That is the level of insight a breakdown should give. If you only see a single number, you cannot fix the system; with the breakdown, the action item is "the tutor stops emitting Spanish on past-simple items, debug why".

Reading patterns across items (the dashboard view)

For an exam of N items, plot the four sub-scores as a stacked bar per item. Patterns to look for:

  • Column missing across items → the model has a systematic miss (e.g., all "schema" zeros → the JSON formatter is broken).
  • Row of zeros for a specific verb → the verb is mis-learned (e.g., all go items are wrong → check the corpus for that verb).
  • Conciseness alone trending down → the model is verbose; the prompt template may have changed.
  • One outlier item → check the normalization (this is what the /break exercise is about).

Normalization rules (the source of false negatives)

For binary axes (correctness, spanish_gloss), the comparison is done after a normalization step:

  1. Strip whitespace at start/end.
  2. Lowercase the string.
  3. Strip a trailing period if present.
  4. Substring match on expected_contains entries — every required substring must appear in the normalized output.

If any of these steps is missing or broken, a correct answer can score zero. The /break exercise removes the case-fold step and demonstrates the resulting false-negative cascade.

Calibrating the rubric

Once you have a stable rubric, you should also have a "stable distribution" — the score the current production tutor achieves on the regression set. The eval gate (Phase 38 theory 06) compares new releases against this stable distribution:

  • 0.95+ on the capstone exam — the production target.
  • ≤ 2% drop on the regression set — the eval-gate threshold.

If a release jumps from 0.95 to 0.98, that's a real improvement — but verify it's not a rubric bug (Phase 38 should also CI the rubric itself). If a release drops from 0.95 to 0.85, that's a regression — block. The 2% threshold leaves room for noise without papering over real problems.

The capstone gate (the §A13 DoD bar)

For Phase 39 to close, the tutor must score ≥ 0.92 on the capstone exam (data/exams/phase-39-capstone.yaml + phase-39-capstone-extra.yaml if it exists). This is the hard gate.

Soft gates (warn but don't block): - p50 latency ≤ 250 ms (from Phase 33 theory 05's budget). - Zero schema failures on the regression set. - Eval gate ≥ 0.93 on the broader regression set.

The hard gate is binary; the soft gates are signals for the Phase 40 hardening pass.

What this chapter does NOT cover

  • Inter-annotator agreement for the gold labels — irrelevant at §A13 scope; the table is canonical.
  • Adversarial eval — Phase 37 covers prompt-injection eval separately.
  • Held-out exam item generation — see docs/phase-41-learner-portal/theory/04-quizzes-and-exams.md for the portal's content pipeline.
  • Token-level evaluation (BLEU, ROUGE) — not used here; the §A13 scope is small enough for exact-match rubrics.

Reference

  • Liang et al., "Holistic Evaluation of Language Models (HELM)" (Stanford CRFM, 2022). The multi-axis evaluation pattern this rubric follows.
  • Eberle et al., "On the Use of Aggregate Scores in NLP Evaluation" (2022). Why per-axis breakdowns beat single aggregates.

Next: ../break/00-break-rubric-case-sensitive.md.