English · Español

Lab 02 — Calibration metrics and the adversarial slice¶

Goal: extend the harness to compute ECE, Brier score, the reliability diagram, and adversarial slice scores broken down by trick category.

Estimated time: 90-120 minutes.

Prereq: Lab 01 done (harness emits results.json with per-probe predictions and confidences). data/eval/adversarial.jsonl has ≥ 20 hand-curated adversarial probes.

What you produce¶

Extensions:

src/eval/calibration.py — ECE, Brier, reliability-diagram helpers.
src/eval/adversarial.py — adversarial slice aggregator.
tests/eval/test_calibration.py — synthetic tests on known-calibrated and known-miscalibrated predictions.

New artifacts (added to experiments/20-eval-report/<checkpoint_name>/):

reliability.png — reliability diagram (predicted-confidence x-axis, empirical-accuracy y-axis, diagonal reference).
adversarial_by_category.csv — accuracy on each trick category.
adversarial_by_category.png — bar chart.
results.json extended with ece, brier, adversarial.overall, adversarial.by_category.

TODOs¶

Block A — `src/eval/calibration.py`¶

import numpy as np

def ece(confidences: np.ndarray,    # shape (N,), in [0,1]
        correct: np.ndarray,         # shape (N,), 0 or 1
        n_bins: int = 10) -> float:
    """Expected Calibration Error with equal-width bins."""
    bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
    N = len(confidences)
    total = 0.0
    for m in range(n_bins):
        lo, hi = bin_edges[m], bin_edges[m+1]
        # include right edge in the last bin
        if m == n_bins - 1:
            mask = (confidences >= lo) & (confidences <= hi)
        else:
            mask = (confidences >= lo) & (confidences < hi)
        if mask.sum() == 0:
            continue
        bin_acc = correct[mask].mean()
        bin_conf = confidences[mask].mean()
        total += (mask.sum() / N) * abs(bin_acc - bin_conf)
    return float(total)


def brier(confidences: np.ndarray, correct: np.ndarray) -> float:
    """Binary Brier score: mean squared error between confidence and correctness."""
    return float(np.mean((confidences - correct) ** 2))


def brier_multiclass(probs: np.ndarray,   # shape (N, C)
                     labels: np.ndarray,  # shape (N,), integer in [0, C)
                     ) -> float:
    """Multi-class Brier: mean over samples of sum-of-squared-deviations across classes."""
    N, C = probs.shape
    one_hot = np.zeros_like(probs)
    one_hot[np.arange(N), labels] = 1.0
    return float(np.mean(np.sum((probs - one_hot) ** 2, axis=1) / C))


def reliability_diagram(confidences: np.ndarray,
                        correct: np.ndarray,
                        n_bins: int = 10,
                        out_path: str = "reliability.png") -> None:
    """Plot bin-mean-confidence vs bin-mean-accuracy. Diagonal reference."""
    import matplotlib.pyplot as plt
    # ... compute per-bin stats, plot

ece returns 0 on a perfectly-calibrated synthetic dataset.
brier returns 0 on perfect predictions (conf=1.0 for correct=1, conf=0.0 for correct=0).
reliability_diagram saves a PNG with: bar plot of per-bin counts (background), scatter of (conf, acc) per bin, and the y=x diagonal.

Block B — `src/eval/adversarial.py`¶

def aggregate_adversarial(probes, predictions) -> dict:
    """Returns:
       {
         "overall": {"n": int, "correct": int, "accuracy": float, "wilson": (lo, hi)},
         "by_category": {
            "over_regularization": {...},
            "wrong_person_agreement": {...},
            "wrong_tense_marker": {...},
            "auxiliary_mismatch": {...},
            "en_es_mismatch": {...},
            "plural_or_oos": {...},
         }
       }
    """
    ...

Categories are derived from probe.reason_code (the trick tag).
Cells with n < 3 are reported but flagged with low_sample: true.

Block C — extend `run_eval` (from Lab 01)¶

After the probe-classification loop, also:

adv_probes = load_probes(adversarial_path)
adv_results = classify_all(model, tokenizer, adv_probes)
results["adversarial"] = aggregate_adversarial(adv_probes, adv_results)

confidences = np.array([r.confidence for r in core_results])
correct = np.array([1 if r.predicted == p.expected else 0
                    for r, p in zip(core_results, core_probes)])
results["ece"] = ece(confidences, correct, n_bins=10)
results["brier"] = brier(confidences, correct)
reliability_diagram(confidences, correct, n_bins=10,
                    out_path=str(out_dir / "reliability.png"))

Block D — synthetic tests in `tests/eval/test_calibration.py`¶

test_ece_perfectly_calibrated:

# 100 samples where confidence=0.7 and accuracy=0.7 exactly
conf = np.full(100, 0.7)
correct = np.array([1]*70 + [0]*30)
assert ece(conf, correct) < 1e-9

test_ece_overconfident:

# 100 samples: confidence=0.95, accuracy=0.50
conf = np.full(100, 0.95)
correct = np.array([1]*50 + [0]*50)
assert abs(ece(conf, correct) - 0.45) < 1e-9

test_brier_perfect:

conf = np.array([1.0, 1.0, 0.0, 0.0])
correct = np.array([1, 1, 0, 0])
assert brier(conf, correct) == 0.0

test_brier_uniform_random:

# Conf=0.5 always; gives Brier=0.25
conf = np.full(100, 0.5)
correct = np.random.RandomState(0).randint(0, 2, size=100).astype(float)
assert abs(brier(conf, correct) - 0.25) < 0.01

test_adversarial_category_aggregation — fixture probes with known reason_codes and a fixed prediction pattern; verify per-category accuracies.

Block E — visualizations¶

reliability.png: - y-axis: empirical accuracy in bin. - x-axis: average confidence in bin. - Diagonal y=x. - Bar overlay (semi-transparent) showing bin counts on a secondary y-axis or annotated. - Title: Reliability — checkpoint=<name>, ECE=<val>, N=<count>.

adversarial_by_category.png: - Horizontal bar chart, one bar per category, length = accuracy, color-coded by sample count. - Annotate the bar with n=<count> and Wilson CI.

Constraints¶

Categories match reason_code tags exactly. Don't invent new categories at aggregation time; the categories are defined in theory/03 and probe reason_code values must match.
Reliability diagram includes empty bins. Don't drop them; show them as zero-height markers so the absence is visible.
Adversarial probes are excluded from the core ECE/Brier calculation. Otherwise the trick examples poison the calibration estimate. Adversarial ECE is computed separately and reported in its own line.

Stop conditions¶

Done when:

All five test_calibration.py tests pass.
experiments/20-eval-report/<name>/reliability.png exists and shows a recognizable curve (could be diagonal-ish, could be over- or under-confident — that's the model's behavior).
experiments/20-eval-report/<name>/adversarial_by_category.csv exists with one row per category.
results.json has ece, brier, adversarial.overall.accuracy, and adversarial.by_category.* populated.

Pitfalls¶

ECE on n_bins > N/3: each bin has ≤ 3 samples; ECE estimates become noise. With N=60 core probes and n_bins=10, each bin has ~6 samples — borderline. If a bin happens to have 0 or 1 samples, that bin contributes 0 to ECE but the next bin's noise increases. Report n_bins and min_bin_count in the JSON.
Confidence is over-the-candidate-set, not over the full vocab. A 4-candidate probe with confidence=0.5 is much weaker than a 2-candidate probe with confidence=0.5. If you mix probes with different len(candidates), calibration is muddled. Either normalize (rescale to "above-random") or report calibration separately per candidate-set-size. Document the choice.
adversarial_by_category cells with n=1 or n=2 are nearly meaningless individually but contribute to the overall adversarial accuracy. Don't suppress them from overall; do flag them as low-sample.
Brier vs ECE disagreement. They measure related but distinct things. ECE = 0 doesn't imply Brier = 0 (a model can be perfectly calibrated at 0.5 confidence with 0.5 accuracy — ECE=0 but Brier=0.25). Report both.
Reliability-diagram axes flipped. Convention: confidence on x, accuracy on y. Get this right or every reader will misread the plot.

When to consult `solutions/`¶

After all tests pass and the reliability + adversarial PNGs are produced. The solution at solutions/02-calibration-ref.md (written at phase open) discusses how to read the reliability diagram and what an "over-confident on adversarials" model means in practice.

Next lab: lab/03-report-and-checkpoint-compare.md.