English · Español
Lab 01 — Wire the eval harness: perplexity + per-slice accuracy¶
Goal: build the eval harness end-to-end. Load a checkpoint, run it over the held-out test set for PPL, run it over the probe set for classification, and emit a JSON results blob plus a per-slice accuracy table.
Estimated time: 120-180 minutes.
Prereq: Lab 00 done (probe schema + validated probe set). Phase 18 produced a checkpoint at
experiments/18-train/checkpoints/final.ptand Phase 19 produced a best-val checkpoint atexperiments/19-debug/checkpoints/best.pt.
What you produce¶
A new module:
src/eval/harness.py— load a checkpoint, run PPL on a tokenized test set, run classification on probes, emit JSON.src/eval/classify.py— the multiple-choice scoring function (likelihood-based).tests/eval/test_harness.py— unit tests on a tiny fixture model.
Run output:
experiments/20-eval-report/<checkpoint_name>/results.json— all the numbers in one place.experiments/20-eval-report/<checkpoint_name>/per_slice.csv— per-slice accuracy table.experiments/20-eval-report/<checkpoint_name>/per_slice.png— bar chart of the same.
The classification scoring function¶
The model is a language model, not a classifier. To score a multiple-choice probe with candidates = [c_1, ..., c_K]:
- Render the prompt with each candidate filled into the
___blank →Kfull sentences. - For each sentence, compute the model's log-likelihood of the candidate tokens conditioned on the prompt prefix (sum of
log p(token_i | context)over the candidate's tokens). - Pick
argmax_k log_p(c_k). That's the prediction. - Confidence =
softmax(log_p)_{argmax}over the candidate set (not the full vocab).
Edge cases:
- Candidates with different token counts. Use length-normalized log-likelihood (divide by token count) to avoid biasing toward shorter forms. Document this choice in the report.
- Tokens not in vocab. If a candidate tokenizes to
<unk>, log-prob underflows; flag the probe asmodel_oovand skip it from accuracy aggregation (count it separately).
import numpy as np
import torch
@torch.no_grad()
def score_candidate(model, tokenizer, prompt: str, candidate: str) -> float:
"""Length-normalized log-likelihood of `candidate` given `prompt`."""
prefix = tokenizer.encode(prompt.replace("___", "").rstrip())
full = tokenizer.encode(prompt.replace("___", candidate))
# candidate token ids are the suffix of `full` beyond `prefix`
cand_ids = full[len(prefix):]
if not cand_ids:
raise ValueError("Empty candidate after tokenization")
# forward pass to get log-probs for the candidate positions
ids = torch.tensor([full], dtype=torch.long)
logits = model(ids).logits[0] # (T, V)
log_probs = torch.log_softmax(logits, dim=-1)
total = 0.0
for i, tid in enumerate(cand_ids):
# position predicting `cand_ids[i]` is `len(prefix) + i - 1`
pos = len(prefix) + i - 1
total += log_probs[pos, tid].item()
return total / len(cand_ids)
def classify_probe(model, tokenizer, probe) -> tuple[str, float]:
"""Returns (predicted_form, confidence)."""
scores = [score_candidate(model, tokenizer, probe.prompt, c)
for c in probe.candidates]
s = np.array(scores)
probs = np.exp(s - s.max()); probs /= probs.sum() # softmax over candidate set
k = int(np.argmax(probs))
return probe.candidates[k], float(probs[k])
TODOs¶
Block A — src/eval/harness.py skeleton¶
from pathlib import Path
import json
def run_eval(checkpoint_path: Path,
probes_path: Path,
test_path: Path,
out_dir: Path,
seed: int = 42) -> dict:
"""Load checkpoint, compute PPL on `test_path`, classify probes, write
results.json + per_slice.csv + per_slice.png. Returns the results dict."""
...
-
seed_everything(seed)at entry. - Load model + tokenizer from
checkpoint_path. - Compute
ppl_test,ppl_train,ppl_val(last two for sanity — train should be lowest, test highest; if not, something is leaking). - Compute PPL split by language (EN-only PPL, ES-only PPL).
- Iterate probes, call
classify_probe, collect(probe.id, predicted, confidence, correct)tuples. - Aggregate accuracy by slice:
overall,by_language,by_regularity,by_tense,by_person,by_verb. - Write
results.jsonwith all numbers + the per-probe predictions list. - Write
per_slice.csv(one row per slice, columns:slice_name,slice_value,n,correct,accuracy,wilson_lo,wilson_hi). - Plot
per_slice.png(bar chart, 4 panels: language, regularity, tense, person). - Persist
manifest.jsonwith model hash, probe-set hash, code commit hash, eval timestamp, seed.
Block B — perplexity computation¶
@torch.no_grad()
def compute_ppl(model, tokenizer, sentences: list[str]) -> float:
total_nll = 0.0
total_tokens = 0
for s in sentences:
ids = torch.tensor([tokenizer.encode(s)], dtype=torch.long)
# teacher-forced loss over (ids[:-1] → ids[1:])
logits = model(ids[:, :-1]).logits[0]
targets = ids[0, 1:]
nll = torch.nn.functional.cross_entropy(
logits, targets, reduction="sum"
).item()
total_nll += nll
total_tokens += targets.numel()
return float(np.exp(total_nll / total_tokens))
- Vectorize: batch the sentences with padding + a mask if the model supports it (Phase-17 model should). Otherwise per-sentence is fine for our N.
- Add Wilson-interval helper (used for accuracy bounds, not PPL).
Block C — Wilson interval¶
import math
def wilson_interval(c: int, n: int, z: float = 1.96) -> tuple[float, float]:
if n == 0:
return (0.0, 0.0)
p = c / n
denom = 1 + z*z / n
centre = (p + z*z / (2*n)) / denom
margin = (z * math.sqrt(p*(1-p)/n + z*z/(4*n*n))) / denom
return (max(0.0, centre - margin), min(1.0, centre + margin))
Used per slice. With N=10-20 per cell, intervals will be wide; document in the report.
Block D — tests/eval/test_harness.py¶
test_score_candidate_picks_correct— on a tiny fixture model that hard-codesworksas the top continuation ofShe ___, verifyclassify_probereturnsworks.test_length_normalization— given two candidates of different token counts, verify the longer one isn't unfairly penalized by raw (non-normalized) log-prob.test_ppl_finite— on a sentence the model assigns nonzero prob to every token, PPL is finite and > 1.test_wilson_at_boundary—wilson_interval(0, 10)returns(0.0, something_small);wilson_interval(10, 10)returns(something_close_to_1, 1.0). Symmetric to normal-approx breaking at boundaries.test_slice_aggregation— given a known set of probe predictions, the aggregator produces the expected per-slice counts and accuracies.
Block E — run it¶
just eval CHECKPOINT=experiments/18-train/checkpoints/final.pt
just eval CHECKPOINT=experiments/19-debug/checkpoints/best.pt
Both runs produce their own subdirectories under experiments/20-eval-report/. Lab 03 compares them.
Constraints¶
- Greedy / temperature-0 only. No sampling in Lab 01. Lab 02 adds confidence-based metrics; Lab 03 (or Phase 21) does sampling.
- CPU-only is fine. N ≈ 80 probes; PPL test set ≈ 200 sentences. Should run in < 60 seconds on Borja's i5-8250U.
- Deterministic. Same checkpoint + same probe set + same seed = byte-identical
results.json.
Stop conditions¶
Done when:
pytest tests/eval/test_harness.py -vpasses.just eval CHECKPOINT=...producesresults.json,per_slice.csv,per_slice.pngunderexperiments/20-eval-report/<name>/.- The per-slice bar chart shows the EN/ES split, regular/irregular split, per-tense, per-person — four panels — and the relative differences pass the eyeball test (regular > irregular expected; EN > ES if corpus is EN-heavy).
- Re-running the same command gives byte-identical
results.json.
Pitfalls¶
- Off-by-one in candidate-token position. The token at position
iof the logits predicts the token at positioni+1of the input. Score the candidate bylogits[len(prefix)-1, cand_ids[0]] + logits[len(prefix), cand_ids[1]] + .... Get this wrong and you'll score the prompt tokens, not the candidate. - Tokenizer adds a BOS token. If
tokenizer.encodeprepends<bos>, yourprefixlength is +1 and the slicing is off. Match what Phase 11 produced. - Padding affects PPL. Pad tokens must be masked out of the NLL sum. Easier: pre-loop without batching for the first pass; vectorize once correctness is verified.
- Spanish probes break tokenization. If your BPE (Phase 11) was EN-only, ES probes will tokenize poorly and
model_oovcount will spike. Decide in the report whether to retrain the tokenizer (no — that invalidates Phase 18 results) or report the limitation. - Per-slice cell sizes too small for Wilson CI. With ≥ 20 verbs and only ~60 probes, the per-verb slice has 2-4 probes per verb. CI will span [0.0, 1.0] — useless. Phase 20 reports per-verb as a table of raw counts, not as percentages with CIs.
When to consult solutions/¶
After all five tests pass and at least one just eval run produces a non-empty per_slice.png. The solution at solutions/01-harness-ref.md (written at phase open) covers the length-normalization gotcha and the batched-PPL path.
Next lab: lab/02-calibration-and-adversarial.md.