English · Español
Lab 02 — Calibration metrics and the adversarial slice¶
Goal: extend the harness to compute ECE, Brier score, the reliability diagram, and adversarial slice scores broken down by trick category.
Estimated time: 90-120 minutes.
Prereq: Lab 01 done (harness emits
results.jsonwith per-probe predictions and confidences).data/eval/adversarial.jsonlhas ≥ 20 hand-curated adversarial probes.
What you produce¶
Extensions:
src/eval/calibration.py— ECE, Brier, reliability-diagram helpers.src/eval/adversarial.py— adversarial slice aggregator.tests/eval/test_calibration.py— synthetic tests on known-calibrated and known-miscalibrated predictions.
New artifacts (added to experiments/20-eval-report/<checkpoint_name>/):
reliability.png— reliability diagram (predicted-confidence x-axis, empirical-accuracy y-axis, diagonal reference).adversarial_by_category.csv— accuracy on each trick category.adversarial_by_category.png— bar chart.results.jsonextended withece,brier,adversarial.overall,adversarial.by_category.
TODOs¶
Block A — src/eval/calibration.py¶
import numpy as np
def ece(confidences: np.ndarray, # shape (N,), in [0,1]
correct: np.ndarray, # shape (N,), 0 or 1
n_bins: int = 10) -> float:
"""Expected Calibration Error with equal-width bins."""
bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
N = len(confidences)
total = 0.0
for m in range(n_bins):
lo, hi = bin_edges[m], bin_edges[m+1]
# include right edge in the last bin
if m == n_bins - 1:
mask = (confidences >= lo) & (confidences <= hi)
else:
mask = (confidences >= lo) & (confidences < hi)
if mask.sum() == 0:
continue
bin_acc = correct[mask].mean()
bin_conf = confidences[mask].mean()
total += (mask.sum() / N) * abs(bin_acc - bin_conf)
return float(total)
def brier(confidences: np.ndarray, correct: np.ndarray) -> float:
"""Binary Brier score: mean squared error between confidence and correctness."""
return float(np.mean((confidences - correct) ** 2))
def brier_multiclass(probs: np.ndarray, # shape (N, C)
labels: np.ndarray, # shape (N,), integer in [0, C)
) -> float:
"""Multi-class Brier: mean over samples of sum-of-squared-deviations across classes."""
N, C = probs.shape
one_hot = np.zeros_like(probs)
one_hot[np.arange(N), labels] = 1.0
return float(np.mean(np.sum((probs - one_hot) ** 2, axis=1) / C))
def reliability_diagram(confidences: np.ndarray,
correct: np.ndarray,
n_bins: int = 10,
out_path: str = "reliability.png") -> None:
"""Plot bin-mean-confidence vs bin-mean-accuracy. Diagonal reference."""
import matplotlib.pyplot as plt
# ... compute per-bin stats, plot
-
ecereturns 0 on a perfectly-calibrated synthetic dataset. -
brierreturns 0 on perfect predictions (conf=1.0forcorrect=1,conf=0.0forcorrect=0). -
reliability_diagramsaves a PNG with: bar plot of per-bin counts (background), scatter of (conf, acc) per bin, and the y=x diagonal.
Block B — src/eval/adversarial.py¶
def aggregate_adversarial(probes, predictions) -> dict:
"""Returns:
{
"overall": {"n": int, "correct": int, "accuracy": float, "wilson": (lo, hi)},
"by_category": {
"over_regularization": {...},
"wrong_person_agreement": {...},
"wrong_tense_marker": {...},
"auxiliary_mismatch": {...},
"en_es_mismatch": {...},
"plural_or_oos": {...},
}
}
"""
...
- Categories are derived from
probe.reason_code(the trick tag). - Cells with
n < 3are reported but flagged withlow_sample: true.
Block C — extend run_eval (from Lab 01)¶
After the probe-classification loop, also:
adv_probes = load_probes(adversarial_path)
adv_results = classify_all(model, tokenizer, adv_probes)
results["adversarial"] = aggregate_adversarial(adv_probes, adv_results)
confidences = np.array([r.confidence for r in core_results])
correct = np.array([1 if r.predicted == p.expected else 0
for r, p in zip(core_results, core_probes)])
results["ece"] = ece(confidences, correct, n_bins=10)
results["brier"] = brier(confidences, correct)
reliability_diagram(confidences, correct, n_bins=10,
out_path=str(out_dir / "reliability.png"))
Block D — synthetic tests in tests/eval/test_calibration.py¶
test_ece_perfectly_calibrated:test_ece_overconfident:test_brier_perfect:test_brier_uniform_random:test_adversarial_category_aggregation— fixture probes with known reason_codes and a fixed prediction pattern; verify per-category accuracies.
Block E — visualizations¶
reliability.png:
- y-axis: empirical accuracy in bin.
- x-axis: average confidence in bin.
- Diagonal y=x.
- Bar overlay (semi-transparent) showing bin counts on a secondary y-axis or annotated.
- Title: Reliability — checkpoint=<name>, ECE=<val>, N=<count>.
adversarial_by_category.png:
- Horizontal bar chart, one bar per category, length = accuracy, color-coded by sample count.
- Annotate the bar with n=<count> and Wilson CI.
Constraints¶
- Categories match
reason_codetags exactly. Don't invent new categories at aggregation time; the categories are defined intheory/03and probereason_codevalues must match. - Reliability diagram includes empty bins. Don't drop them; show them as zero-height markers so the absence is visible.
- Adversarial probes are excluded from the core ECE/Brier calculation. Otherwise the trick examples poison the calibration estimate. Adversarial ECE is computed separately and reported in its own line.
Stop conditions¶
Done when:
- All five
test_calibration.pytests pass. experiments/20-eval-report/<name>/reliability.pngexists and shows a recognizable curve (could be diagonal-ish, could be over- or under-confident — that's the model's behavior).experiments/20-eval-report/<name>/adversarial_by_category.csvexists with one row per category.results.jsonhasece,brier,adversarial.overall.accuracy, andadversarial.by_category.*populated.
Pitfalls¶
- ECE on
n_bins > N/3: each bin has ≤ 3 samples; ECE estimates become noise. With N=60 core probes andn_bins=10, each bin has ~6 samples — borderline. If a bin happens to have 0 or 1 samples, that bin contributes 0 to ECE but the next bin's noise increases. Reportn_binsandmin_bin_countin the JSON. - Confidence is over-the-candidate-set, not over the full vocab. A 4-candidate probe with
confidence=0.5is much weaker than a 2-candidate probe withconfidence=0.5. If you mix probes with differentlen(candidates), calibration is muddled. Either normalize (rescale to "above-random") or report calibration separately per candidate-set-size. Document the choice. adversarial_by_categorycells with n=1 or n=2 are nearly meaningless individually but contribute to the overall adversarial accuracy. Don't suppress them fromoverall; do flag them as low-sample.- Brier vs ECE disagreement. They measure related but distinct things. ECE = 0 doesn't imply Brier = 0 (a model can be perfectly calibrated at 0.5 confidence with 0.5 accuracy — ECE=0 but Brier=0.25). Report both.
- Reliability-diagram axes flipped. Convention: confidence on x, accuracy on y. Get this right or every reader will misread the plot.
When to consult solutions/¶
After all tests pass and the reliability + adversarial PNGs are produced. The solution at solutions/02-calibration-ref.md (written at phase open) discusses how to read the reliability diagram and what an "over-confident on adversarials" model means in practice.
Next lab: lab/03-report-and-checkpoint-compare.md.