English · Español
Lab 03 — Generate REPORT.md and compare two checkpoints side-by-side¶
Goal: produce a human-readable Markdown report for one checkpoint, then build a comparator that diffs two reports and highlights significant changes. End with a one-paragraph written interpretation per checkpoint.
Estimated time: 90-120 minutes.
Prereq: Labs 00-02 done;
results.json,per_slice.csv,reliability.png,adversarial_by_category.pngexist underexperiments/20-eval-report/<checkpoint_name>/for both the Phase-18 final checkpoint and the Phase-19 best-val checkpoint.
What you produce¶
A new module:
src/eval/report.py— renderresults.json→REPORT.md.src/eval/compare.py— diff tworesults.jsonfiles; renderCOMPARE.md.tests/eval/test_report.py— golden-file tests on a fixture results.json.
Output artifacts:
experiments/20-eval-report/<checkpoint_name>/REPORT.md(one per checkpoint).experiments/20-eval-report/COMPARE.md(one comparing the two).learners/borja/phase-20/reflections.md(Borja's written interpretation).
The REPORT.md template¶
Sections, in order:
# Eval Report — <checkpoint_name>
**Checkpoint hash:** <sha256>
**Training step:** <step>
**Parent manifest:** <hash>
**Probe set version:** <hash>
**Eval date:** <iso8601>
**Seed:** <int>
## 1. Headline
| Metric | Value | 95% CI |
|---|---|---|
| PPL (test, overall) | 7.42 | — |
| PPL (test, EN) | 6.81 | — |
| PPL (test, ES) | 8.93 | — |
| Aggregate accuracy | 0.78 | [0.66, 0.87] |
| ECE | 0.092 | — |
| Brier | 0.171 | — |
| Adversarial accuracy | 0.55 | [0.32, 0.76] |
## 2. Per-slice accuracy
### By language
<table>
### By regularity
<table>
### By tense
<table>
### By person
<table>
### By verb (counts only — small N per cell)
<table>
## 3. Reliability diagram

## 4. Adversarial slice
| Category | n | Accuracy | 95% CI |
|---|---|---|---|
| Over-regularization | 5 | 0.40 | [0.12, 0.77] |
| Wrong-person agreement | 4 | 0.75 | [0.30, 0.95] |
| ...

## 5. Confusion matrix
| | pred CORRECT | pred INCORRECT | pred AMBIGUOUS |
|---|---|---|---|
| true CORRECT | 28 | 4 | 2 |
| true INCORRECT | 6 | 18 | 0 |
| true AMBIGUOUS | 1 | 0 | 1 |
## 6. Generation sub-eval (pass@k)
| Prompt cell | pass@1 | pass@10 |
|---|---|---|
## 7. Interpretation
<one paragraph; Borja writes>
## 8. Caveats
- Probe set N=60 → per-cell counts are small → wide CIs.
- ES tokenizer was Phase-11 EN-biased; expect inflated ES PPL.
- Adversarial category cells with n<3 are flagged but reported.
TODOs¶
Block A — src/eval/report.py¶
def render_report(results_path: Path, out_path: Path) -> None:
"""Read results.json + per_slice.csv; write REPORT.md."""
...
- Loads
results.json,per_slice.csv, embeds PNG references. - Renders each table as Markdown.
- Inserts a placeholder block for §7 Interpretation with TODO marker
<!-- INTERPRETATION: Borja writes 1-2 paragraphs here. -->. - Writes to
out_path(must be inside the per-checkpoint directory).
Block B — src/eval/compare.py¶
def render_compare(results_a: Path, results_b: Path, out_path: Path) -> None:
"""Diff two results.json files; flag significant differences."""
...
- Side-by-side tables for every metric in the headline.
- Significance flag per row: compute overlapping Wilson CIs. If CIs overlap, mark
(within noise); if not, mark▲or▼. - Per-slice diff: only rows where the direction of change is opposite to the aggregate change, or where the magnitude exceeds 10 percentage points, are highlighted.
- Adversarial-category diff: same format.
- Reliability diagrams: both PNGs embedded side-by-side via HTML in Markdown (
<table><tr><td>...</td></tr></table>).
Block C — tests/eval/test_report.py¶
test_renders_minimal_report— given a small fixtureresults.json,render_reportemits aREPORT.mdcontaining the required section headers.test_compare_within_noise— given tworesults.jsondiffering only by 1pp in accuracy with N=60, the comparison marks the diff as(within noise).test_compare_significant_drop— given tworesults.jsonwhere accuracy drops by 20pp, the comparison emits▼.test_interpretation_placeholder_present—REPORT.mdcontains the TODO marker.
Block D — write learners/borja/phase-20/reflections.md¶
In your own words (300-500 words), answer:
- What does this model do well? Cite specific slices.
- What does it do poorly? Cite specific slices and adversarial categories.
- Why? Hypothesize the cause from corpus or training-dynamics signals (Phase 12 corpus stats, Phase 19 loss curves). Examples:
- "The model is 60% on irregular past-simple vs 92% on regular past-simple. I suspect Phase 12's corpus is dominated by regular forms; the irregular forms appear ~3× less per
data/processed/stats.json." - "ECE is 0.18 on the Phase-19 best-val checkpoint vs 0.09 on the Phase-18 final. The best-val was selected against val loss alone, which doesn't penalize over-confidence — Phase 18's longer training led to better-calibrated outputs even at slightly higher val loss."
- What would I change? Don't fix it here — Phase 20 measures, doesn't train. List the items as TODOs for future phases (Phase 28 LoRA, Phase 12 corpus regen, etc.).
Block E — also fill the §7 interpretation in each REPORT.md¶
The placeholder must be replaced. The interpretation is the deliverable; numbers are scaffolding for the sentence.
Constraints¶
- Markdown only. No HTML hacks except the two-PNG-side-by-side table (acceptable because mkdocs renders it).
- REPORT.md fits in one screen-worth of headers. Use collapsible details if necessary, but the headlines must be visible without scrolling.
- Borja writes the prose. Claude can draft, but the reflection in
learners/borja/phase-20/reflections.mdis unmistakably Borja's — first-person.
Stop conditions¶
Done when:
pytest tests/eval/test_report.py -vpasses.- Both checkpoints have a populated
REPORT.md(no remaining TODO markers). experiments/20-eval-report/COMPARE.mdexists and highlights at least one significant difference (or explicitly states "no significant differences detected" with reasoning).learners/borja/phase-20/reflections.mdis at least 300 words and identifies the three weakest slices, with hypothesized causes.
Pitfalls¶
- Reporting averages where slices disagree. If EN is at 0.85 and ES at 0.55, the aggregate (~0.70) is the least informative number. Lead with the slice table, not the aggregate.
- Significance-flagging too eagerly. N=60 and 5pp differences are within noise. Only flag ≥10pp differences with non-overlapping CIs. Otherwise readers get used to ignoring the arrows.
- Filling §7 with metric restatements. "PPL is 7.42 and accuracy is 0.78" — that's the table. The interpretation says what those numbers imply. Force yourself to write at least one sentence that ties the model's behavior to a downstream concern (Phase 32 tutor agent, deployment risk, corpus gap).
- Forgetting the caveats. ES tokenizer bias, small N per cell, probe set's hand-curation bias — these all bound the report's validity. List them.
When to consult solutions/¶
After all four tests pass and both REPORTs + COMPARE are populated. The solution at solutions/03-report-ref.md (written at phase open) shows an example interpretation paragraph that's specific, hypothesizes causes, and points to a future fix without trying to fix it in Phase 20.
End of Phase 20 labs. Time to write PHASE_20_REPORT.md (the phase-level report, not the eval REPORT.md) and prep for Phase 21.
Next: Phase 21 — Inference Internals & Sampling.