English · Español
02 — Metrics math: derivations behind each formula¶
🇪🇸 Cuatro derivaciones cortas. Saber de dónde viene cada métrica te permite saber cuándo no aplica.
1. Perplexity from cross-entropy¶
Given a model \(p_\theta(y | x)\) and a sequence of held-out tokens \(y_1, \ldots, y_N\) with contexts \(x_1, \ldots, x_N\), the per-token cross-entropy is:
Equivalently, the negative average log-likelihood. Perplexity is the exponential:
Derivation of "effective branching factor"¶
For a uniform distribution over \(V\) tokens: \(\log p = -\log V\), so \(\bar{\mathcal{L}} = \log V\) and \(\text{PPL} = V\).
For the model: \(\text{PPL}\) is the size of the uniform distribution the model is effectively choosing among. A model with PPL 7 is "as uncertain as a uniform random over 7 tokens."
Comparing perplexities¶
Two models on the same test set are directly comparable. Two models on different test sets are not, even if both sets have the same vocabulary, because PPL depends on the data distribution. In particular, EN-only PPL is not comparable to ES-only PPL even if the model is bilingual — the languages have different intrinsic entropy on the §A13 scope.
Phase 20's reports always identify the test set hash and language split so PPLs are unambiguously comparable.
2. Pass@k unbiased estimator¶
Given a model and a prompt, sample \(n\) outputs at temperature \(T\). Let \(c\) be the number of correct samples. The naive estimator \(c/n\) is unbiased for \(\text{pass@}1\) but biased low for \(\text{pass@}k\) when \(k > 1\).
The Chen et al. 2021 unbiased estimator:
Why this is unbiased: imagine drawing \(k\) samples from your \(n\). The probability that all \(k\) are wrong is \(\binom{n-c}{k} / \binom{n}{k}\) (choose \(k\) from the \(n-c\) wrong). One minus that is the probability that at least one of the \(k\) is right.
For \(n \to \infty\) at fixed correctness rate \(p = c/n\): $\(\widehat{\text{pass@}k} \to 1 - (1-p)^k\)$
which matches the binomial intuition.
Implementation¶
def pass_at_k(n: int, c: int, k: int) -> float:
if n - c < k:
return 1.0
return 1.0 - math.comb(n - c, k) / math.comb(n, k)
Numerically stable for the sizes we use (n=20, k=10, c=0..n). For much larger n, use the log-Gamma form.
When pass@k is appropriate¶
- The task allows multiple attempts (alternative phrasings, paraphrases).
- The judge / verifier is automated and cheap.
- The model is sampled at moderate-to-high temperature for diversity.
When not appropriate: classification tasks (where you only get one answer), production deployments that don't tolerate "best of k" gymnastics. For Phase 20, pass@k is a sanity signal on the free-form completion sub-eval, not a headline.
3. ECE: Expected Calibration Error¶
For classification with predicted class \(\hat y_i\) and predicted confidence \(\hat p_i\):
- Bin samples by confidence into \(M\) equal-width bins (default \(M=10\): bin 1 is \([0, 0.1)\), bin 10 is \([0.9, 1.0]\)).
- For each bin \(B_m\):
- \(\text{acc}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \mathbb{1}[\hat y_i = y_i]\)
- \(\text{conf}(B_m) = \frac{1}{|B_m|}\sum_{i \in B_m} \hat p_i\)
- \(\text{ECE} = \sum_m \frac{|B_m|}{N} |\text{acc}(B_m) - \text{conf}(B_m)|\)
Geometric interpretation¶
Plot \((\text{conf}(B_m), \text{acc}(B_m))\) as points on the unit square. The diagonal line \(y = x\) is "perfectly calibrated." ECE is the weighted vertical distance from each point to the diagonal, weighted by bin population.
Worked example¶
Three bins. Bin 1: 10 samples, average conf 0.55, accuracy 0.4. Bin 2: 30 samples, average conf 0.75, accuracy 0.7. Bin 3: 60 samples, average conf 0.95, accuracy 0.83.
So the model is, on average, off by 10 percentage points.
Sensitivity to \(M\)¶
Few bins (small \(M\)) → coarse, possibly missing miscalibration in narrow regions. Many bins → fine but noisy if N is small. The "\(M=10\) + N=60-100" combination has ~6-10 samples/bin, which is the minimum for stable estimates. Phase 20's reports note the bin count.
For very small \(N\), use the adaptive ECE which uses equal-population bins instead of equal-width. Not implemented in Phase 20; stretch goal.
4. Brier score¶
For binary classification with predicted probability \(p_i\) of the positive class and label \(y_i \in \{0, 1\}\):
Lower = better. Range \([0, 1]\). Perfect prediction: 0. Pure random (predicting 0.5 always): 0.25.
Decomposition¶
Brier decomposes into three components (Murphy 1973):
- Reliability (calibration term): like ECE but squared. Lower = more calibrated.
- Resolution: how much the model's predictions vary by ground truth. Higher = more discriminating.
- Uncertainty: irreducible loss from class imbalance. Independent of the model.
A model can have low Brier by being calibrated, by being discriminating, or by working on a low-uncertainty task. The decomposition tells you which.
Phase 20 reports Brier alongside ECE; the decomposition is a stretch goal in lab/02-calibration-and-adversarial.md.
Multi-class Brier¶
For three classes (correct, incorrect, ambiguous):
where \(C=3\) and \(p_{i,c}\) is the predicted probability of class \(c\) for sample \(i\). Average over the \(C\) binary Brier scores; sometimes divided by 2 (the "Brier" vs "half-Brier" convention). Phase 20 uses the version without the /2.
5. Confusion matrix metrics¶
The matrix is:
(only diagonals shown for brevity; C = correct, I = incorrect, A = ambiguous)
From it:
- Per-class precision: of all predictions of class \(c\), how many were really \(c\)? \(P_c = \text{TP}_c / \sum_i M[i, c]\).
- Per-class recall: of all true examples of class \(c\), how many did we predict? \(R_c = \text{TP}_c / \sum_j M[c, j]\).
- Per-class F1: \(F1_c = 2 P_c R_c / (P_c + R_c)\).
Phase 20 reports per-class precision and recall in a table; F1 is mentioned but not the headline because the harmonic-mean compression often hides the actual P/R tradeoff.
The most diagnostic cell for our task is (true=incorrect, predicted=correct) — the model failing to catch an error. For a tutor agent, this is the danger: silently confirming wrong grammar to a learner.
6. Significance, briefly¶
When comparing two checkpoints, are the differences in accuracy meaningful? With \(N = 60\) probes, the binomial standard error at \(p = 0.8\) is:
So differences smaller than ~5 percentage points are within noise. Phase 20's reports include a confidence-interval column (Wilson interval, not normal-approx because it's better at boundary values).
Wilson interval for proportion \(p\) with \(N\) samples at 95% confidence:
where \(z = 1.96\) for 95%.
For \(N=60, p=0.8\): \([0.68, 0.88]\). Comparing this checkpoint to one at \(p=0.85\) (with overlapping CI) is not a meaningful win.
Phase 20's report flags differences as "significant" or "within noise" using the overlapping-Wilson-CI heuristic.
One-paragraph recap¶
PPL is \(e^{\text{CE}}\); pass@k uses Chen's unbiased binomial estimator; ECE bins predictions by confidence and measures the weighted vertical deviation from the perfect-calibration diagonal; Brier squares the prediction-vs-label error and decomposes into calibration + resolution + uncertainty terms. Per-class metrics from the confusion matrix give precision/recall/F1 with Wilson intervals for significance. Every formula is a few lines and a few decisions of convention (the bin count, the temperature, the class weighting); the choices are documented per-report so comparisons stay valid.
Next: theory/03-probe-construction.md.