English · Español
Lab 02 — Drift detection (KL + PSI on the verb-token distribution)¶
Goal: implement
scripts/mlops/drift.pyand demonstrate it trips on a known shift of the verb-token distribution.Estimated time: 3–4 hours.
Prereq: lab 00 done; the Phase 12 verb-corpus + its training token histogram are accessible via DVC.
What you produce¶
experiments/38-drift-detection/ containing:
inject.py— script that constructs the synthetic shift over the verb grid.measure.py— computes KL and PSI on baseline and shifted distributions.results.json— KL/PSI at varying perturbation levels, plus per-bucket grid-PSI.sensitivity.png— KL & PSI vs perturbation magnitude (matplotlib).manifest.json.README.md.
The scenario¶
You have:
- \(P\) = the Phase 12 training-token histogram for the verb corpus (around 2,000 tokens after BPE; counts summing to a few million).
- \(P_{\text{grid}}\) = the training-time (tense × person) cell-frequency table (5 × 3 = 15 cells).
- \(Q_0\) = a clean live-traffic sample from the same distribution (1,000 tokens drawn from \(P\)).
- \(Q_\rho\) = a shifted sample: 1,000 tokens, of which a fraction \(\rho \in \{0, 0.05, 0.10, 0.15, 0.25, 0.50\}\) have been replaced with tokens from a shifted tense regime (e.g., when training was present-simple-heavy, the perturbed sample is past-participle-heavy).
Compute KL and PSI (both token-PSI and grid-PSI) for \(P\) vs each \(Q_\rho\). Confirm the metrics rise monotonically with \(\rho\) and cross the documented thresholds (PSI 0.10, 0.25) at sensible \(\rho\) values.
TODOs¶
Block A — assemble \(P\) and \(P_{\text{grid}}\)¶
-
dvc pull data/processed/train.jsonl.dvcto fetch the Phase 12 corpus. - Tokenize with the Phase 11 BPE tokenizer. Build a length-\(|V|\) count vector
P_countsand normalize to a probability vectorP(with \(\alpha = 10^{-6}\) smoothing pertheory/03). - Build
P_grid: a 5 × 3 = 15-cell frequency table classifying each sentence in the corpus by its primary verb tense and person. (Use the labels in the corpus manifest —gen_corpus.pyfrom Phase 12 emits these.) - Persist
Pasbaseline_histogram.jsonandP_gridasbaseline_grid.jsonin the experiment directory. They will be re-used by Phase 39.
Block B — construct \(Q_\rho\)¶
- Sample 1,000 tokens from \(P\) → that's \(Q_0\).
- Construct an OOD distribution
Rover the same vocabulary, biased toward tokens associated with a shifted tense (e.g., past-participle suffixes-ed,-en, plus the irregular past participlesgone,been,done,written,seen,come,eaten). If the training distribution was present-simple-heavy, thisRsimulates a sudden influx of past-participle queries. - For each \(\rho\), build \(Q_\rho\): a 1,000-token sample where each position is independently drawn from \(P\) with probability \(1-\rho\) and from \(R\) with probability \(\rho\).
- Similarly construct a
Q_grid_\rhoper perturbation level: a 15-cell histogram where past-participle cells receive \(\rho\) extra mass. - Persist each \(Q_\rho\) as
q_rho_<rho>.json.
Block C — compute KL and PSI¶
- Implement
kl_divergence(P, Q)inscripts/mlops/drift.py. Use NumPy. Apply \(\alpha = 10^{-6}\) smoothing to both \(P\) and \(Q\). - Implement
psi(P, Q)in the same file. Use the frequency-stratified bucketing fromtheory/03: top 20 tokens in 5 buckets of 4, next 200 in 5 buckets of 40, remaining tail in 5 buckets of ~360. That's a 15-bucket token-PSI for the verb corpus. - Implement
grid_psi(P_grid, Q_grid): PSI over the (tense, person) cells. 15 cells. - For each \(\rho\) in \(\{0, 0.05, 0.10, 0.15, 0.25, 0.50\}\): compute
kl = kl_divergence(P, Q_rho),psi = psi(P, Q_rho),grid_psi = grid_psi(P_grid, Q_grid_rho). Save toresults.json.
Block D — plot and interpret¶
- Plot KL, token-PSI, and grid-PSI vs \(\rho\) on the same x-axis, with the PSI thresholds (0.10, 0.25) as horizontal lines.
- Identify (and document in
README.md): at what \(\rho\) does each metric cross 0.10? 0.25? Which one trips first? For a shift concentrated in one tense (past participle), grid-PSI should trip before token-PSI — verify this. - Confirm all three metrics are monotonically increasing in \(\rho\). If they're not, the smoothing, the bucketing, or the OOD distribution
Ris wrong — debug before continuing.
Block E — sample-size sanity check¶
- Repeat the \(\rho = 0.15\) measurement with \(Q\) sizes of \(\{100, 1{,}000, 10{,}000, 100{,}000\}\) tokens. Plot KL, token-PSI, and grid-PSI vs \(Q\) size at fixed \(\rho\).
- Confirm all three metrics stabilize for \(|Q| \geq 1{,}000\) and are noisy below. This is the lower-bound rationale for the production drift cadence.
Block F — Justfile recipe + manifest + README¶
- Add
just drift-checkto invokescripts/mlops/run_drift_check.pyon the liverequests.logfrom a Phase 33/34 run, comparing againstbaseline_histogram.json+baseline_grid.json. The recipe writes adrift_reports/YYYY-MM-DD.jsonand exits non-zero if any PSI ≥ 0.25. -
manifest.jsonlists: seed, Phase 11 tokenizer SHA, Phase 12 corpus DVC hash, list of \(\rho\) values, smoothing \(\alpha\), PSI bucket scheme. -
README.md(300–500 words): - The threshold table for this corpus: \(\rho_{0.10}, \rho_{0.25}\) per metric.
- The sample-size lower bound.
- One paragraph: why grid-PSI tripped before token-PSI (or didn't — record either way).
- One paragraph: what the next 24h of production drift detection would do with these calibrated thresholds.
Constraints¶
- NumPy only. No SciPy
entropyorpsifunctions — write the formulas yourself. (Verifying against SciPy post hoc is fine.) - Deterministic samples. Use a seeded
np.random.default_rng(seed). The same lab run from the same seed must produce identical KL and PSI values. - No
mlflow.log_metricfor tracking — write tomanifest.jsonandresults.jsononly. MLflow tracking enters in the CI gate (lab 04), not in offline drift analysis. - No new
src/<module>/.drift.pylives inscripts/mlops/. Thedrift_checkCLI is a thin wrapper.
Stop conditions¶
Done when:
- KL, token-PSI, and grid-PSI all monotonically increase with \(\rho\).
- PSI 0.10 and 0.25 thresholds cross at \(\rho\) values that look reasonable (typically \(\rho_{0.10} \approx 0.05\)–\(0.10\) and \(\rho_{0.25} \approx 0.15\)–\(0.25\) for this corpus shape, but lab will calibrate).
- The sample-size sanity check shows stabilization above \(|Q| = 1{,}000\).
just drift-checkruns end-to-end on a synthetic live log and produces a report.manifest.jsonandREADME.mdare committed.
Pitfalls¶
- \(\log(0)\). If any \(Q(t) = 0\) for \(t\) with \(P(t) > 0\), KL diverges. Smoothing must be applied to both distributions, every time.
- Wrong direction. \(D_{KL}(P \| Q) \neq D_{KL}(Q \| P)\). Use the "training first" convention from
theory/03. Check the formula matchessum P * log(P / Q), not the reverse. - Bucket size choice. Equal-width PSI buckets over a power-law verb-token distribution put almost all mass in the rare-token buckets. Use the 15-bucket frequency-stratified scheme from Block C, not naive equal-width.
- Vocabulary mismatch. If the OOD tokens are not in \(V\), you can't represent them. Map all OOD tokens to a special
[UNK]id (or to existing rare tokens). Sanity-check: every token in \(Q_\rho\) has a count slot in your \(|V|\)-length vector. - Float precision. Smoothed probabilities are tiny (\(\sim 10^{-7}\)). Use
float64throughout.float32underflows. - Conflating grid cells. The (tense, person) grid has 15 cells exactly per §A13. Don't add a "negation" or "interrogative" cell — those are future-work expansions. Stick to the corpus shape.
When to consult solutions/¶
After all six blocks are done. solutions/02-drift-ref.md (phase open) reviews your bucketing scheme, smoothing, and the calibrated thresholds.
Next lab: lab/03-finops-table.md.