English · Español

Lab 02 — Drift detection (KL + PSI on the verb-token distribution)¶

Goal: implement scripts/mlops/drift.py and demonstrate it trips on a known shift of the verb-token distribution.

Estimated time: 3–4 hours.

Prereq: lab 00 done; the Phase 12 verb-corpus + its training token histogram are accessible via DVC.

What you produce¶

experiments/38-drift-detection/ containing:

inject.py — script that constructs the synthetic shift over the verb grid.
measure.py — computes KL and PSI on baseline and shifted distributions.
results.json — KL/PSI at varying perturbation levels, plus per-bucket grid-PSI.
sensitivity.png — KL & PSI vs perturbation magnitude (matplotlib).
manifest.json.
README.md.

The scenario¶

You have:

\(P\) = the Phase 12 training-token histogram for the verb corpus (around 2,000 tokens after BPE; counts summing to a few million).
\(P_{\text{grid}}\) = the training-time (tense × person) cell-frequency table (5 × 3 = 15 cells).
\(Q_0\) = a clean live-traffic sample from the same distribution (1,000 tokens drawn from \(P\)).
\(Q_\rho\) = a shifted sample: 1,000 tokens, of which a fraction \(\rho \in \{0, 0.05, 0.10, 0.15, 0.25, 0.50\}\) have been replaced with tokens from a shifted tense regime (e.g., when training was present-simple-heavy, the perturbed sample is past-participle-heavy).

Compute KL and PSI (both token-PSI and grid-PSI) for \(P\) vs each \(Q_\rho\). Confirm the metrics rise monotonically with \(\rho\) and cross the documented thresholds (PSI 0.10, 0.25) at sensible \(\rho\) values.

TODOs¶

Block A — assemble \(P\) and \(P_{\text{grid}}\)¶

dvc pull data/processed/train.jsonl.dvc to fetch the Phase 12 corpus.
Tokenize with the Phase 11 BPE tokenizer. Build a length-\(|V|\) count vector P_counts and normalize to a probability vector P (with \(\alpha = 10^{-6}\) smoothing per theory/03).
Build P_grid: a 5 × 3 = 15-cell frequency table classifying each sentence in the corpus by its primary verb tense and person. (Use the labels in the corpus manifest — gen_corpus.py from Phase 12 emits these.)
Persist P as baseline_histogram.json and P_grid as baseline_grid.json in the experiment directory. They will be re-used by Phase 39.

Block B — construct \(Q_\rho\)¶

Sample 1,000 tokens from \(P\) → that's \(Q_0\).
Construct an OOD distribution R over the same vocabulary, biased toward tokens associated with a shifted tense (e.g., past-participle suffixes -ed, -en, plus the irregular past participles gone, been, done, written, seen, come, eaten). If the training distribution was present-simple-heavy, this R simulates a sudden influx of past-participle queries.
For each \(\rho\), build \(Q_\rho\): a 1,000-token sample where each position is independently drawn from \(P\) with probability \(1-\rho\) and from \(R\) with probability \(\rho\).
Similarly construct a Q_grid_\rho per perturbation level: a 15-cell histogram where past-participle cells receive \(\rho\) extra mass.
Persist each \(Q_\rho\) as q_rho_<rho>.json.

Block C — compute KL and PSI¶

Implement kl_divergence(P, Q) in scripts/mlops/drift.py. Use NumPy. Apply \(\alpha = 10^{-6}\) smoothing to both \(P\) and \(Q\).
Implement psi(P, Q) in the same file. Use the frequency-stratified bucketing from theory/03: top 20 tokens in 5 buckets of 4, next 200 in 5 buckets of 40, remaining tail in 5 buckets of ~360. That's a 15-bucket token-PSI for the verb corpus.
Implement grid_psi(P_grid, Q_grid): PSI over the (tense, person) cells. 15 cells.
For each \(\rho\) in \(\{0, 0.05, 0.10, 0.15, 0.25, 0.50\}\): compute kl = kl_divergence(P, Q_rho), psi = psi(P, Q_rho), grid_psi = grid_psi(P_grid, Q_grid_rho). Save to results.json.

Block D — plot and interpret¶

Plot KL, token-PSI, and grid-PSI vs \(\rho\) on the same x-axis, with the PSI thresholds (0.10, 0.25) as horizontal lines.
Identify (and document in README.md): at what \(\rho\) does each metric cross 0.10? 0.25? Which one trips first? For a shift concentrated in one tense (past participle), grid-PSI should trip before token-PSI — verify this.
Confirm all three metrics are monotonically increasing in \(\rho\). If they're not, the smoothing, the bucketing, or the OOD distribution R is wrong — debug before continuing.

Block E — sample-size sanity check¶

Repeat the \(\rho = 0.15\) measurement with \(Q\) sizes of \(\{100, 1{,}000, 10{,}000, 100{,}000\}\) tokens. Plot KL, token-PSI, and grid-PSI vs \(Q\) size at fixed \(\rho\).
Confirm all three metrics stabilize for \(|Q| \geq 1{,}000\) and are noisy below. This is the lower-bound rationale for the production drift cadence.

Block F — `Justfile` recipe + manifest + README¶

Add just drift-check to invoke scripts/mlops/run_drift_check.py on the live requests.log from a Phase 33/34 run, comparing against baseline_histogram.json + baseline_grid.json. The recipe writes a drift_reports/YYYY-MM-DD.json and exits non-zero if any PSI ≥ 0.25.
manifest.json lists: seed, Phase 11 tokenizer SHA, Phase 12 corpus DVC hash, list of \(\rho\) values, smoothing \(\alpha\), PSI bucket scheme.
README.md (300–500 words):
The threshold table for this corpus: \(\rho_{0.10}, \rho_{0.25}\) per metric.
The sample-size lower bound.
One paragraph: why grid-PSI tripped before token-PSI (or didn't — record either way).
One paragraph: what the next 24h of production drift detection would do with these calibrated thresholds.

Constraints¶

NumPy only. No SciPy entropy or psi functions — write the formulas yourself. (Verifying against SciPy post hoc is fine.)
Deterministic samples. Use a seeded np.random.default_rng(seed). The same lab run from the same seed must produce identical KL and PSI values.
No mlflow.log_metric for tracking — write to manifest.json and results.json only. MLflow tracking enters in the CI gate (lab 04), not in offline drift analysis.
No new src/<module>/. drift.py lives in scripts/mlops/. The drift_check CLI is a thin wrapper.

Stop conditions¶

Done when:

KL, token-PSI, and grid-PSI all monotonically increase with \(\rho\).
PSI 0.10 and 0.25 thresholds cross at \(\rho\) values that look reasonable (typically \(\rho_{0.10} \approx 0.05\)–\(0.10\) and \(\rho_{0.25} \approx 0.15\)–\(0.25\) for this corpus shape, but lab will calibrate).
The sample-size sanity check shows stabilization above \(|Q| = 1{,}000\).
just drift-check runs end-to-end on a synthetic live log and produces a report.
manifest.json and README.md are committed.

Pitfalls¶

\(\log(0)\). If any \(Q(t) = 0\) for \(t\) with \(P(t) > 0\), KL diverges. Smoothing must be applied to both distributions, every time.
Wrong direction. \(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\). Use the "training first" convention from theory/03. Check the formula matches sum P * log(P / Q), not the reverse.
Bucket size choice. Equal-width PSI buckets over a power-law verb-token distribution put almost all mass in the rare-token buckets. Use the 15-bucket frequency-stratified scheme from Block C, not naive equal-width.
Vocabulary mismatch. If the OOD tokens are not in \(V\), you can't represent them. Map all OOD tokens to a special [UNK] id (or to existing rare tokens). Sanity-check: every token in \(Q_\rho\) has a count slot in your \(|V|\)-length vector.
Float precision. Smoothed probabilities are tiny (\(\sim 10^{-7}\)). Use float64 throughout. float32 underflows.
Conflating grid cells. The (tense, person) grid has 15 cells exactly per §A13. Don't add a "negation" or "interrogative" cell — those are future-work expansions. Stick to the corpus shape.

When to consult `solutions/`¶

After all six blocks are done. solutions/02-drift-ref.md (phase open) reviews your bucketing scheme, smoothing, and the calibrated thresholds.

Next lab: lab/03-finops-table.md.