English · Español

Lab 03 — Calibration analysis on a toy classifier¶

Read theory/03-cross-entropy-and-mle.md. Do not consult solutions/.

Objective¶

Build a tiny classifier over the 5 tenses, then measure its calibration — does its expressed confidence match its empirical accuracy? This bridges to Phase 20 (evaluation harness) where calibration becomes a first-class metric for the mini-GPT.

Setup¶

Continue in src/phase05/probability.py and a new src/phase05/toy_classifier.py.

Background¶

A model that always predicts "past tense with 80% confidence" and is actually right 80% of the time is calibrated. A model that predicts 80% confidence but is right 95% of the time is under-confident; one that's right 50% of the time is over-confident. Neural networks are notoriously over-confident — this lab gets the intuition into your hands.

The expected calibration error (ECE) for \(M\) bins:

\[\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N} \big| \text{acc}(B_m) - \text{conf}(B_m) \big|,\]

where \(B_m\) is the set of predictions in confidence bin \(m\), \(\text{acc}(B_m)\) is the empirical accuracy in that bin, and \(\text{conf}(B_m)\) is the average confidence.

Tasks¶

Task 1 — build the toy classifier¶

Generate a synthetic dataset of (input, tense) pairs. Inputs are 8-dim feature vectors; tense labels are drawn from a class-conditional Gaussian (5 classes, modest separation). 5000 train, 1000 test.

Train a logistic-regression classifier (5-way softmax). Use NumPy + SGD; no PyTorch. (This is a Phase 7 preview — embrace it.)

Task 2 — measure raw accuracy and calibration¶

For the test set:

Accuracy. Argmax prediction vs true label.
Confidence per prediction. \(\max_i q_i(x)\).
Bin into 10 equal-width bins in confidence ∈ [0, 1].
Per-bin: accuracy vs average confidence.
Compute ECE.

Task 3 — reliability diagram¶

Plot: x-axis = bin midpoint (confidence), y-axis = accuracy in that bin. The 45° line is perfect calibration. Bars above the diagonal = under-confident; bars below = over-confident.

Save as experiments/<date>-phase-05-calibration/reliability.png.

Task 4 — temperature scaling for calibration¶

Re-fit a single scalar parameter \(T\) that divides the logits before softmax: \(q_i = \text{softmax}(z / T)_i\). Optimise \(T\) on a held-out validation split to minimise the validation cross-entropy. Then re-measure ECE on the test set.

Expected: ECE drops substantially. This is the entire content of Guo et al. 2017 ("On Calibration of Modern Neural Networks"), recapitulated by Borja on a toy.

Task 5 — diagnose¶

Where does your classifier sit before scaling? Over-confident, under-confident, or roughly calibrated? Why does this happen? (Hint: cross-entropy minimisation on finite data tends to push confidences too high.)

Measurements to capture¶

ECE before and after temperature scaling.
Reliability diagrams: before and after.
Optimal \(T^*\) — typically \(T^* > 1\) for over-confident models.
Manifest in experiments/<date>-phase-05-calibration/manifest.json.

Acceptance¶

Toy classifier trained; test accuracy reasonable (≥ 70%).
ECE computed; reliability diagram saved.
Temperature scaling implemented and applied.
Post-scaling ECE strictly lower than pre-scaling ECE (or document the rare case where it isn't).
Written diagnosis: 2-3 paragraphs in your lab notes.

Pitfalls to expect¶

Empty bins. With 1000 test points and 10 bins, some bins may have 0 predictions; their contribution to ECE is 0 by convention.
Choice of \(M\) (number of bins) affects ECE — too few hides errors, too many is noisy. \(M = 10\) or \(M = 15\) are standard.
Temperature scaling minimises CE, not ECE directly. They are correlated but not identical. The fact that minimising CE on a held-out split helps ECE is non-obvious and is the punchline of the lab.
Don't mix train and validation for temperature fitting — that's leakage.

Next: Phase 06 — Python for AI Engineering (after /quiz 05 and /phase-report 05).