English · Español
Break 00 — Capstone evaluator with a case-sensitive normalization bug; false negatives cascade¶
🇪🇸 Si quitas el
lower()del normalizador, respuestas correctas como"Ate"o"WENT"puntúan cero por correctness. El score agregado cae ~30%, parece que el tutor regresó, y el eval-gate bloquea el despliegue por una bug del evaluador, no del modelo.
What you'll do¶
Disable the .lower() step in the rubric's normalization function. Re-run the capstone exam against the unchanged tutor. Watch correct answers score zero. Diagnose by reading the score breakdown.
Step 1 — Locate the normalizer¶
src/minieval/normalize.py # the rubric normalization helpers
src/minieval/score.py # the per-axis scoring functions
(Names approximate; the exam runner uses these.)
Step 2 — Introduce the bug¶
In src/minieval/normalize.py, the canonical pipeline is:
Remove the lower():
def normalize(s: str) -> str:
s = s.strip()
# s = s.lower() # <-- removed
s = s.rstrip(".")
return s
The function still runs. Strings still pass through. Trailing periods are still stripped. Only the case-folding is gone.
Step 3 — Run the exam¶
$ just eval-exam --exam phase-39-capstone
[eval] running 5 items...
[eval] e-39-01 'Conjugate eat past simple 3sg' expected="ate" got="Ate" score=0.05 [BREAKDOWN]
[eval] correctness=0.0 spanish_gloss=0.0 schema=1.0 conciseness=1.0
[eval] e-39-02 'Pick simple-future' choice=1 expected=1 score=1.00
[eval] e-39-03 'Pick present-simple 3sg' expected=[0,2] got=[0,2] score=1.00
[eval] e-39-04 'Spanish for they have written' expected="escrito" got="han Escrito" score=0.15 [BREAKDOWN]
[eval] correctness=1.0 spanish_gloss=0.0 schema=1.0 conciseness=1.0
[eval] e-39-05 'going to eat 1sg' expected="going to eat" got="I am Going to eat" score=0.15
[eval] aggregate=0.47
[eval] FAIL: aggregate 0.47 < gate 0.92
The tutor is fine. The evaluator is wrong. Items e-39-01, e-39-04, e-39-05 all have the right answer with the wrong case — "Ate" instead of "ate", "han Escrito" instead of "han escrito" — and the case-sensitive normalizer fails the substring match.
Step 4 — Diagnose from the breakdown¶
This is the key teaching moment: the breakdown shows which axis is failing.
- e-39-01:
correctness=0.0, spanish_gloss=0.0, schema=1.0. - e-39-04:
correctness=1.0, spanish_gloss=0.0, schema=1.0.
The pattern: schema is always 1.0 (the tutor's output is well-formed), correctness and gloss are dropping in different items. If it were a model regression, you'd expect correctness to drop consistently. The fact that it drops only on items where the substring case differs points at the matching logic, not the model.
Quick check: compare the raw got strings to the expected_contains substrings. Every failed match has a case mismatch.
That tells you: bug is in the normalizer, not the tutor.
Step 5 — Record the break¶
learners/borja/phase-39/notes/breaks.md:
- bug-id: 39-01
concept: evaluator normalization; false-negative cascade
symptom: capstone aggregate drops from 0.95 to 0.47 with the same tutor.
Pattern: correctness=0 on items where the model output's case
differs from expected_contains. Schema axis always 1.0.
hidden_cause: normalize.py removed the .lower() step; substring match is
now case-sensitive; the tutor's natural capitalization fails
to match the lowercased ground-truth tokens.
hint_1: "Print normalize(got) and normalize(expected). Are they comparable?"
hint_2: "Compare 'Ate' substring against 'ate'. What's missing?"
hint_3: "Look at the normalize() pipeline. What's been removed?"
fix_diff: uncomment the s = s.lower() line in normalize.py.
Step 6 — Apply the fix¶
Uncomment the lower() call. Re-run the exam:
$ just eval-exam --exam phase-39-capstone
[eval] running 5 items... [all 5 pass]
[eval] aggregate=0.95
[eval] PASS
The tutor was always correct. The fix is one line. The lesson is that the evaluator is also software — it has bugs, it deserves tests, and a "regression" in the exam may not be a regression in the model.
Why this is the right /break for capstone¶
The Phase 39 capstone integrates everything. A representative failure mode here is exactly this: a small bug in a piece of evaluation infrastructure makes the whole system look broken when it isn't. Future-Borja, on a future project, must remember to diagnose from the breakdown before redoing the model.
The fix is trivial. The diagnostic discipline is not.
Hard rules respected¶
- Single, instructive bug (one normalizer step removed).
- Reversible in 1 line.
- Observable: aggregate score collapses; breakdown reveals the pattern.
- No security implication.
- No test modified to mask the issue.
Next: when green, re-read ../theory/06-capstone-evaluation-rubric.md — specifically the "Normalization rules" section.