Skip to content

English · Español

Break 00 — Capstone evaluator with a case-sensitive normalization bug; false negatives cascade

🇪🇸 Si quitas el lower() del normalizador, respuestas correctas como "Ate" o "WENT" puntúan cero por correctness. El score agregado cae ~30%, parece que el tutor regresó, y el eval-gate bloquea el despliegue por una bug del evaluador, no del modelo.


What you'll do

Disable the .lower() step in the rubric's normalization function. Re-run the capstone exam against the unchanged tutor. Watch correct answers score zero. Diagnose by reading the score breakdown.

Step 1 — Locate the normalizer

src/minieval/normalize.py         # the rubric normalization helpers
src/minieval/score.py             # the per-axis scoring functions

(Names approximate; the exam runner uses these.)

Step 2 — Introduce the bug

In src/minieval/normalize.py, the canonical pipeline is:

def normalize(s: str) -> str:
    s = s.strip()
    s = s.lower()
    s = s.rstrip(".")
    return s

Remove the lower():

def normalize(s: str) -> str:
    s = s.strip()
    # s = s.lower()      # <-- removed
    s = s.rstrip(".")
    return s

The function still runs. Strings still pass through. Trailing periods are still stripped. Only the case-folding is gone.

Step 3 — Run the exam

$ just eval-exam --exam phase-39-capstone
[eval] running 5 items...
[eval] e-39-01 'Conjugate eat past simple 3sg' expected="ate" got="Ate" score=0.05 [BREAKDOWN]
[eval]   correctness=0.0 spanish_gloss=0.0 schema=1.0 conciseness=1.0
[eval] e-39-02 'Pick simple-future' choice=1 expected=1 score=1.00
[eval] e-39-03 'Pick present-simple 3sg' expected=[0,2] got=[0,2] score=1.00
[eval] e-39-04 'Spanish for they have written' expected="escrito" got="han Escrito" score=0.15 [BREAKDOWN]
[eval]   correctness=1.0 spanish_gloss=0.0 schema=1.0 conciseness=1.0
[eval] e-39-05 'going to eat 1sg' expected="going to eat" got="I am Going to eat" score=0.15
[eval] aggregate=0.47
[eval] FAIL: aggregate 0.47 < gate 0.92

The tutor is fine. The evaluator is wrong. Items e-39-01, e-39-04, e-39-05 all have the right answer with the wrong case — "Ate" instead of "ate", "han Escrito" instead of "han escrito" — and the case-sensitive normalizer fails the substring match.

Step 4 — Diagnose from the breakdown

This is the key teaching moment: the breakdown shows which axis is failing.

  • e-39-01: correctness=0.0, spanish_gloss=0.0, schema=1.0.
  • e-39-04: correctness=1.0, spanish_gloss=0.0, schema=1.0.

The pattern: schema is always 1.0 (the tutor's output is well-formed), correctness and gloss are dropping in different items. If it were a model regression, you'd expect correctness to drop consistently. The fact that it drops only on items where the substring case differs points at the matching logic, not the model.

Quick check: compare the raw got strings to the expected_contains substrings. Every failed match has a case mismatch.

That tells you: bug is in the normalizer, not the tutor.

Step 5 — Record the break

learners/borja/phase-39/notes/breaks.md:

- bug-id: 39-01
  concept: evaluator normalization; false-negative cascade
  symptom: capstone aggregate drops from 0.95 to 0.47 with the same tutor.
           Pattern: correctness=0 on items where the model output's case
           differs from expected_contains. Schema axis always 1.0.
  hidden_cause: normalize.py removed the .lower() step; substring match is
                now case-sensitive; the tutor's natural capitalization fails
                to match the lowercased ground-truth tokens.
  hint_1: "Print normalize(got) and normalize(expected). Are they comparable?"
  hint_2: "Compare 'Ate' substring against 'ate'. What's missing?"
  hint_3: "Look at the normalize() pipeline. What's been removed?"
  fix_diff: uncomment the s = s.lower() line in normalize.py.

Step 6 — Apply the fix

Uncomment the lower() call. Re-run the exam:

$ just eval-exam --exam phase-39-capstone
[eval] running 5 items... [all 5 pass]
[eval] aggregate=0.95
[eval] PASS

The tutor was always correct. The fix is one line. The lesson is that the evaluator is also software — it has bugs, it deserves tests, and a "regression" in the exam may not be a regression in the model.

Why this is the right /break for capstone

The Phase 39 capstone integrates everything. A representative failure mode here is exactly this: a small bug in a piece of evaluation infrastructure makes the whole system look broken when it isn't. Future-Borja, on a future project, must remember to diagnose from the breakdown before redoing the model.

The fix is trivial. The diagnostic discipline is not.

Hard rules respected

  • Single, instructive bug (one normalizer step removed).
  • Reversible in 1 line.
  • Observable: aggregate score collapses; breakdown reveals the pattern.
  • No security implication.
  • No test modified to mask the issue.

Next: when green, re-read ../theory/06-capstone-evaluation-rubric.md — specifically the "Normalization rules" section.