English · Español
Lab 03 — Calibration analysis on a toy classifier¶
Read
theory/03-cross-entropy-and-mle.md. Do not consultsolutions/.
Objective¶
Build a tiny classifier over the 5 tenses, then measure its calibration — does its expressed confidence match its empirical accuracy? This bridges to Phase 20 (evaluation harness) where calibration becomes a first-class metric for the mini-GPT.
Setup¶
Continue in src/phase05/probability.py and a new src/phase05/toy_classifier.py.
Background¶
A model that always predicts "past tense with 80% confidence" and is actually right 80% of the time is calibrated. A model that predicts 80% confidence but is right 95% of the time is under-confident; one that's right 50% of the time is over-confident. Neural networks are notoriously over-confident — this lab gets the intuition into your hands.
The expected calibration error (ECE) for \(M\) bins:
where \(B_m\) is the set of predictions in confidence bin \(m\), \(\text{acc}(B_m)\) is the empirical accuracy in that bin, and \(\text{conf}(B_m)\) is the average confidence.
Tasks¶
Task 1 — build the toy classifier¶
Generate a synthetic dataset of (input, tense) pairs. Inputs are 8-dim feature vectors; tense labels are drawn from a class-conditional Gaussian (5 classes, modest separation). 5000 train, 1000 test.
Train a logistic-regression classifier (5-way softmax). Use NumPy + SGD; no PyTorch. (This is a Phase 7 preview — embrace it.)
Task 2 — measure raw accuracy and calibration¶
For the test set:
- Accuracy. Argmax prediction vs true label.
- Confidence per prediction. \(\max_i q_i(x)\).
- Bin into 10 equal-width bins in confidence ∈ [0, 1].
- Per-bin: accuracy vs average confidence.
- Compute ECE.
Task 3 — reliability diagram¶
Plot: x-axis = bin midpoint (confidence), y-axis = accuracy in that bin. The 45° line is perfect calibration. Bars above the diagonal = under-confident; bars below = over-confident.
Save as experiments/<date>-phase-05-calibration/reliability.png.
Task 4 — temperature scaling for calibration¶
Re-fit a single scalar parameter \(T\) that divides the logits before softmax: \(q_i = \text{softmax}(z / T)_i\). Optimise \(T\) on a held-out validation split to minimise the validation cross-entropy. Then re-measure ECE on the test set.
Expected: ECE drops substantially. This is the entire content of Guo et al. 2017 ("On Calibration of Modern Neural Networks"), recapitulated by Borja on a toy.
Task 5 — diagnose¶
Where does your classifier sit before scaling? Over-confident, under-confident, or roughly calibrated? Why does this happen? (Hint: cross-entropy minimisation on finite data tends to push confidences too high.)
Measurements to capture¶
- ECE before and after temperature scaling.
- Reliability diagrams: before and after.
- Optimal \(T^*\) — typically \(T^* > 1\) for over-confident models.
- Manifest in
experiments/<date>-phase-05-calibration/manifest.json.
Acceptance¶
- Toy classifier trained; test accuracy reasonable (≥ 70%).
- ECE computed; reliability diagram saved.
- Temperature scaling implemented and applied.
- Post-scaling ECE strictly lower than pre-scaling ECE (or document the rare case where it isn't).
- Written diagnosis: 2-3 paragraphs in your lab notes.
Pitfalls to expect¶
- Empty bins. With 1000 test points and 10 bins, some bins may have 0 predictions; their contribution to ECE is 0 by convention.
- Choice of \(M\) (number of bins) affects ECE — too few hides errors, too many is noisy. \(M = 10\) or \(M = 15\) are standard.
- Temperature scaling minimises CE, not ECE directly. They are correlated but not identical. The fact that minimising CE on a held-out split helps ECE is non-obvious and is the punchline of the lab.
- Don't mix train and validation for temperature fitting — that's leakage.
Next: Phase 06 — Python for AI Engineering (after /quiz 05 and /phase-report 05).