English · Español

Lab 03 — Generate REPORT.md and compare two checkpoints side-by-side¶

Goal: produce a human-readable Markdown report for one checkpoint, then build a comparator that diffs two reports and highlights significant changes. End with a one-paragraph written interpretation per checkpoint.

Estimated time: 90-120 minutes.

Prereq: Labs 00-02 done; results.json, per_slice.csv, reliability.png, adversarial_by_category.png exist under experiments/20-eval-report/<checkpoint_name>/ for both the Phase-18 final checkpoint and the Phase-19 best-val checkpoint.

What you produce¶

A new module:

src/eval/report.py — render results.json → REPORT.md.
src/eval/compare.py — diff two results.json files; render COMPARE.md.
tests/eval/test_report.py — golden-file tests on a fixture results.json.

Output artifacts:

experiments/20-eval-report/<checkpoint_name>/REPORT.md (one per checkpoint).
experiments/20-eval-report/COMPARE.md (one comparing the two).
learners/borja/phase-20/reflections.md (Borja's written interpretation).

The REPORT.md template¶

Sections, in order:

# Eval Report — <checkpoint_name>

**Checkpoint hash:** <sha256>
**Training step:** <step>
**Parent manifest:** <hash>
**Probe set version:** <hash>
**Eval date:** <iso8601>
**Seed:** <int>

## 1. Headline

| Metric | Value | 95% CI |
|---|---|---|
| PPL (test, overall) | 7.42 | — |
| PPL (test, EN) | 6.81 | — |
| PPL (test, ES) | 8.93 | — |
| Aggregate accuracy | 0.78 | [0.66, 0.87] |
| ECE | 0.092 | — |
| Brier | 0.171 | — |
| Adversarial accuracy | 0.55 | [0.32, 0.76] |

## 2. Per-slice accuracy

### By language
<table>

### By regularity
<table>

### By tense
<table>

### By person
<table>

### By verb (counts only — small N per cell)
<table>

## 3. Reliability diagram

![reliability](reliability.png)

## 4. Adversarial slice

| Category | n | Accuracy | 95% CI |
|---|---|---|---|
| Over-regularization | 5 | 0.40 | [0.12, 0.77] |
| Wrong-person agreement | 4 | 0.75 | [0.30, 0.95] |
| ...

![adversarial](adversarial_by_category.png)

## 5. Confusion matrix

|              | pred CORRECT | pred INCORRECT | pred AMBIGUOUS |
|---|---|---|---|
| true CORRECT | 28 | 4 | 2 |
| true INCORRECT | 6 | 18 | 0 |
| true AMBIGUOUS | 1 | 0 | 1 |

## 6. Generation sub-eval (pass@k)

| Prompt cell | pass@1 | pass@10 |
|---|---|---|

## 7. Interpretation

<one paragraph; Borja writes>

## 8. Caveats

- Probe set N=60 → per-cell counts are small → wide CIs.
- ES tokenizer was Phase-11 EN-biased; expect inflated ES PPL.
- Adversarial category cells with n<3 are flagged but reported.

TODOs¶

Block A — `src/eval/report.py`¶

def render_report(results_path: Path, out_path: Path) -> None:
    """Read results.json + per_slice.csv; write REPORT.md."""
    ...

Loads results.json, per_slice.csv, embeds PNG references.
Renders each table as Markdown.
Inserts a placeholder block for §7 Interpretation with TODO marker .
Writes to out_path (must be inside the per-checkpoint directory).

Block B — `src/eval/compare.py`¶

def render_compare(results_a: Path, results_b: Path, out_path: Path) -> None:
    """Diff two results.json files; flag significant differences."""
    ...

Side-by-side tables for every metric in the headline.
Significance flag per row: compute overlapping Wilson CIs. If CIs overlap, mark (within noise); if not, mark ▲ or ▼.
Per-slice diff: only rows where the direction of change is opposite to the aggregate change, or where the magnitude exceeds 10 percentage points, are highlighted.
Adversarial-category diff: same format.
Reliability diagrams: both PNGs embedded side-by-side via HTML in Markdown (<table><tr><td>...</td></tr></table>).

Block C — `tests/eval/test_report.py`¶

test_renders_minimal_report — given a small fixture results.json, render_report emits a REPORT.md containing the required section headers.
test_compare_within_noise — given two results.json differing only by 1pp in accuracy with N=60, the comparison marks the diff as (within noise).
test_compare_significant_drop — given two results.json where accuracy drops by 20pp, the comparison emits ▼.
test_interpretation_placeholder_present — REPORT.md contains the TODO marker.

Block D — write `learners/borja/phase-20/reflections.md`¶

In your own words (300-500 words), answer:

What does this model do well? Cite specific slices.
What does it do poorly? Cite specific slices and adversarial categories.
Why? Hypothesize the cause from corpus or training-dynamics signals (Phase 12 corpus stats, Phase 19 loss curves). Examples:
"The model is 60% on irregular past-simple vs 92% on regular past-simple. I suspect Phase 12's corpus is dominated by regular forms; the irregular forms appear ~3× less per data/processed/stats.json."
"ECE is 0.18 on the Phase-19 best-val checkpoint vs 0.09 on the Phase-18 final. The best-val was selected against val loss alone, which doesn't penalize over-confidence — Phase 18's longer training led to better-calibrated outputs even at slightly higher val loss."
What would I change? Don't fix it here — Phase 20 measures, doesn't train. List the items as TODOs for future phases (Phase 28 LoRA, Phase 12 corpus regen, etc.).

Block E — also fill the §7 interpretation in each REPORT.md¶

The placeholder must be replaced. The interpretation is the deliverable; numbers are scaffolding for the sentence.

Constraints¶

Markdown only. No HTML hacks except the two-PNG-side-by-side table (acceptable because mkdocs renders it).
REPORT.md fits in one screen-worth of headers. Use collapsible details if necessary, but the headlines must be visible without scrolling.
Borja writes the prose. Claude can draft, but the reflection in learners/borja/phase-20/reflections.md is unmistakably Borja's — first-person.

Stop conditions¶

Done when:

pytest tests/eval/test_report.py -v passes.
Both checkpoints have a populated REPORT.md (no remaining TODO markers).
experiments/20-eval-report/COMPARE.md exists and highlights at least one significant difference (or explicitly states "no significant differences detected" with reasoning).
learners/borja/phase-20/reflections.md is at least 300 words and identifies the three weakest slices, with hypothesized causes.

Pitfalls¶

Reporting averages where slices disagree. If EN is at 0.85 and ES at 0.55, the aggregate (~0.70) is the least informative number. Lead with the slice table, not the aggregate.
Significance-flagging too eagerly. N=60 and 5pp differences are within noise. Only flag ≥10pp differences with non-overlapping CIs. Otherwise readers get used to ignoring the arrows.
Filling §7 with metric restatements. "PPL is 7.42 and accuracy is 0.78" — that's the table. The interpretation says what those numbers imply. Force yourself to write at least one sentence that ties the model's behavior to a downstream concern (Phase 32 tutor agent, deployment risk, corpus gap).
Forgetting the caveats. ES tokenizer bias, small N per cell, probe set's hand-curation bias — these all bound the report's validity. List them.

When to consult `solutions/`¶

After all four tests pass and both REPORTs + COMPARE are populated. The solution at solutions/03-report-ref.md (written at phase open) shows an example interpretation paragraph that's specific, hypothesizes causes, and points to a future fix without trying to fix it in Phase 20.

End of Phase 20 labs. Time to write PHASE_20_REPORT.md (the phase-level report, not the eval REPORT.md) and prep for Phase 21.

Next: Phase 21 — Inference Internals & Sampling.