Skip to content

English · Español

Lab 01 — Wire the eval harness: perplexity + per-slice accuracy

Goal: build the eval harness end-to-end. Load a checkpoint, run it over the held-out test set for PPL, run it over the probe set for classification, and emit a JSON results blob plus a per-slice accuracy table.

Estimated time: 120-180 minutes.

Prereq: Lab 00 done (probe schema + validated probe set). Phase 18 produced a checkpoint at experiments/18-train/checkpoints/final.pt and Phase 19 produced a best-val checkpoint at experiments/19-debug/checkpoints/best.pt.


What you produce

A new module:

  • src/eval/harness.py — load a checkpoint, run PPL on a tokenized test set, run classification on probes, emit JSON.
  • src/eval/classify.py — the multiple-choice scoring function (likelihood-based).
  • tests/eval/test_harness.py — unit tests on a tiny fixture model.

Run output:

  • experiments/20-eval-report/<checkpoint_name>/results.json — all the numbers in one place.
  • experiments/20-eval-report/<checkpoint_name>/per_slice.csv — per-slice accuracy table.
  • experiments/20-eval-report/<checkpoint_name>/per_slice.png — bar chart of the same.

The classification scoring function

The model is a language model, not a classifier. To score a multiple-choice probe with candidates = [c_1, ..., c_K]:

  1. Render the prompt with each candidate filled into the ___ blank → K full sentences.
  2. For each sentence, compute the model's log-likelihood of the candidate tokens conditioned on the prompt prefix (sum of log p(token_i | context) over the candidate's tokens).
  3. Pick argmax_k log_p(c_k). That's the prediction.
  4. Confidence = softmax(log_p)_{argmax} over the candidate set (not the full vocab).

Edge cases:

  • Candidates with different token counts. Use length-normalized log-likelihood (divide by token count) to avoid biasing toward shorter forms. Document this choice in the report.
  • Tokens not in vocab. If a candidate tokenizes to <unk>, log-prob underflows; flag the probe as model_oov and skip it from accuracy aggregation (count it separately).
import numpy as np
import torch

@torch.no_grad()
def score_candidate(model, tokenizer, prompt: str, candidate: str) -> float:
    """Length-normalized log-likelihood of `candidate` given `prompt`."""
    prefix = tokenizer.encode(prompt.replace("___", "").rstrip())
    full = tokenizer.encode(prompt.replace("___", candidate))
    # candidate token ids are the suffix of `full` beyond `prefix`
    cand_ids = full[len(prefix):]
    if not cand_ids:
        raise ValueError("Empty candidate after tokenization")
    # forward pass to get log-probs for the candidate positions
    ids = torch.tensor([full], dtype=torch.long)
    logits = model(ids).logits[0]  # (T, V)
    log_probs = torch.log_softmax(logits, dim=-1)
    total = 0.0
    for i, tid in enumerate(cand_ids):
        # position predicting `cand_ids[i]` is `len(prefix) + i - 1`
        pos = len(prefix) + i - 1
        total += log_probs[pos, tid].item()
    return total / len(cand_ids)


def classify_probe(model, tokenizer, probe) -> tuple[str, float]:
    """Returns (predicted_form, confidence)."""
    scores = [score_candidate(model, tokenizer, probe.prompt, c)
              for c in probe.candidates]
    s = np.array(scores)
    probs = np.exp(s - s.max()); probs /= probs.sum()  # softmax over candidate set
    k = int(np.argmax(probs))
    return probe.candidates[k], float(probs[k])

TODOs

Block A — src/eval/harness.py skeleton

from pathlib import Path
import json

def run_eval(checkpoint_path: Path,
             probes_path: Path,
             test_path: Path,
             out_dir: Path,
             seed: int = 42) -> dict:
    """Load checkpoint, compute PPL on `test_path`, classify probes, write
    results.json + per_slice.csv + per_slice.png. Returns the results dict."""
    ...
  • seed_everything(seed) at entry.
  • Load model + tokenizer from checkpoint_path.
  • Compute ppl_test, ppl_train, ppl_val (last two for sanity — train should be lowest, test highest; if not, something is leaking).
  • Compute PPL split by language (EN-only PPL, ES-only PPL).
  • Iterate probes, call classify_probe, collect (probe.id, predicted, confidence, correct) tuples.
  • Aggregate accuracy by slice: overall, by_language, by_regularity, by_tense, by_person, by_verb.
  • Write results.json with all numbers + the per-probe predictions list.
  • Write per_slice.csv (one row per slice, columns: slice_name, slice_value, n, correct, accuracy, wilson_lo, wilson_hi).
  • Plot per_slice.png (bar chart, 4 panels: language, regularity, tense, person).
  • Persist manifest.json with model hash, probe-set hash, code commit hash, eval timestamp, seed.

Block B — perplexity computation

@torch.no_grad()
def compute_ppl(model, tokenizer, sentences: list[str]) -> float:
    total_nll = 0.0
    total_tokens = 0
    for s in sentences:
        ids = torch.tensor([tokenizer.encode(s)], dtype=torch.long)
        # teacher-forced loss over (ids[:-1] → ids[1:])
        logits = model(ids[:, :-1]).logits[0]
        targets = ids[0, 1:]
        nll = torch.nn.functional.cross_entropy(
            logits, targets, reduction="sum"
        ).item()
        total_nll += nll
        total_tokens += targets.numel()
    return float(np.exp(total_nll / total_tokens))
  • Vectorize: batch the sentences with padding + a mask if the model supports it (Phase-17 model should). Otherwise per-sentence is fine for our N.
  • Add Wilson-interval helper (used for accuracy bounds, not PPL).

Block C — Wilson interval

import math

def wilson_interval(c: int, n: int, z: float = 1.96) -> tuple[float, float]:
    if n == 0:
        return (0.0, 0.0)
    p = c / n
    denom = 1 + z*z / n
    centre = (p + z*z / (2*n)) / denom
    margin = (z * math.sqrt(p*(1-p)/n + z*z/(4*n*n))) / denom
    return (max(0.0, centre - margin), min(1.0, centre + margin))

Used per slice. With N=10-20 per cell, intervals will be wide; document in the report.

Block D — tests/eval/test_harness.py

  1. test_score_candidate_picks_correct — on a tiny fixture model that hard-codes works as the top continuation of She ___, verify classify_probe returns works.
  2. test_length_normalization — given two candidates of different token counts, verify the longer one isn't unfairly penalized by raw (non-normalized) log-prob.
  3. test_ppl_finite — on a sentence the model assigns nonzero prob to every token, PPL is finite and > 1.
  4. test_wilson_at_boundarywilson_interval(0, 10) returns (0.0, something_small); wilson_interval(10, 10) returns (something_close_to_1, 1.0). Symmetric to normal-approx breaking at boundaries.
  5. test_slice_aggregation — given a known set of probe predictions, the aggregator produces the expected per-slice counts and accuracies.

Block E — run it

just eval CHECKPOINT=experiments/18-train/checkpoints/final.pt
just eval CHECKPOINT=experiments/19-debug/checkpoints/best.pt

Both runs produce their own subdirectories under experiments/20-eval-report/. Lab 03 compares them.

Constraints

  • Greedy / temperature-0 only. No sampling in Lab 01. Lab 02 adds confidence-based metrics; Lab 03 (or Phase 21) does sampling.
  • CPU-only is fine. N ≈ 80 probes; PPL test set ≈ 200 sentences. Should run in < 60 seconds on Borja's i5-8250U.
  • Deterministic. Same checkpoint + same probe set + same seed = byte-identical results.json.

Stop conditions

Done when:

  1. pytest tests/eval/test_harness.py -v passes.
  2. just eval CHECKPOINT=... produces results.json, per_slice.csv, per_slice.png under experiments/20-eval-report/<name>/.
  3. The per-slice bar chart shows the EN/ES split, regular/irregular split, per-tense, per-person — four panels — and the relative differences pass the eyeball test (regular > irregular expected; EN > ES if corpus is EN-heavy).
  4. Re-running the same command gives byte-identical results.json.

Pitfalls

  • Off-by-one in candidate-token position. The token at position i of the logits predicts the token at position i+1 of the input. Score the candidate by logits[len(prefix)-1, cand_ids[0]] + logits[len(prefix), cand_ids[1]] + .... Get this wrong and you'll score the prompt tokens, not the candidate.
  • Tokenizer adds a BOS token. If tokenizer.encode prepends <bos>, your prefix length is +1 and the slicing is off. Match what Phase 11 produced.
  • Padding affects PPL. Pad tokens must be masked out of the NLL sum. Easier: pre-loop without batching for the first pass; vectorize once correctness is verified.
  • Spanish probes break tokenization. If your BPE (Phase 11) was EN-only, ES probes will tokenize poorly and model_oov count will spike. Decide in the report whether to retrain the tokenizer (no — that invalidates Phase 18 results) or report the limitation.
  • Per-slice cell sizes too small for Wilson CI. With ≥ 20 verbs and only ~60 probes, the per-verb slice has 2-4 probes per verb. CI will span [0.0, 1.0] — useless. Phase 20 reports per-verb as a table of raw counts, not as percentages with CIs.

When to consult solutions/

After all five tests pass and at least one just eval run produces a non-empty per_slice.png. The solution at solutions/01-harness-ref.md (written at phase open) covers the length-normalization gotcha and the batched-PPL path.


Next lab: lab/02-calibration-and-adversarial.md.