English · Español
Lab 01 — Shadow + A/B routing on the grammar-tutor endpoint¶
Goal: wire
src/miniserve/traffic.py(a file added inside the existing module, not a new module) into the Phase 33 server and produce a side-by-side correction-quality report for two grammar-tutor variants.Estimated time: 4–6 hours.
Prereq: lab 00 done;
src/miniserve/accepts requests for the grammar-tutor endpoint (/v1/grammar/correct); Phase 20 eval harness can grade outputs against the canonical conjugation table.
What you produce¶
experiments/38-shadow-ab/ containing:
route.py— driver that wires shadow + A/B into the serving stack.traffic.log— the routing decisions for each request (a JSONL).compare.py— script that diffs A and B outputs and produces a report.report.md— the side-by-side report (conjugation accuracy, latency p50/p95/p99, diff statistics).manifest.json.
The scenario¶
Two registered models from lab 00:
- A = Phase 18 FP32 baseline (semver v0.1.0). Plain Mini-GPT without grammar-tutor specialization.
- B = Phase 28 LoRA grammar tutor (semver v0.3.0). LoRA adapter on top of A, trained on the verb conjugation grid.
You will:
- Configure the
/v1/grammar/correctendpoint to shadow B against A for 200 requests drawn from the Phase 20 eval set. - Reconfigure the same endpoint to A/B 50/50 for another 200 requests (a latency A/B, not a quality A/B — quality is graded offline against the canonical table).
- Produce a report comparing the two variants.
TODOs¶
Block A — shadow mode¶
- Add
src/miniserve/traffic.pywith the stateless router described intheory/02. Public API:assign(request_id: str, route_config: RouteConfig) -> Assignment.RouteConfigcarries strategy + arm SHAs + salt. - Configure
/v1/grammar/correctso that on each incoming sentence: - The router decides "production_arm=A, shadow_arm=B".
- The handler dispatches A's inference synchronously, serves A's correction to the client.
- The handler dispatches B's inference as a background asyncio task (do not await it on the request path).
- When B completes, logs both A and B corrections, both latencies, the assignment, and the trace ID to
traffic.log(one JSON line per request). - Confirm via inspection: the user-facing latency is ≈ A's latency, not A + B.
- Send 200 requests sampled from the Phase 20 eval set (use
requestsorhttpxfrom a small Python client). Confirm 200 entries intraffic.log, each with bothresponse_Aandresponse_B.
Block B — A/B mode (latency-only)¶
- Reconfigure
/v1/grammar/correctto A/B 50/50 viahash(request_id, salt="prod-2026-05") mod 2. - Send 200 more requests. Confirm
traffic.lognow shows ~100 entries with A served and ~100 with B served (some variance expected from the hash). - Important: record only the served response and its latency. We're not collecting quality online — only latency. Online quality A/Bs for the grammar tutor are an anti-pattern (see
theory/02). - Verify stickiness: send the same request_id twice in a row, confirm both responses come from the same arm.
Block C — offline grading and compare¶
- In
compare.py: - For the shadow traces (Block A), grade both A and B corrections against the Phase 20 canonical conjugation table. Report conjugation_accuracy_A vs conjugation_accuracy_B.
- Compute the diff distribution: of the 200 requests, in how many did A and B produce different corrections? Of those, how many were correct for A vs correct for B?
- For the A/B traces (Block B), compute latency percentiles (p50, p95, p99) per arm.
- Write
report.md: - Conjugation quality (from shadow). Accuracy A = X%, Accuracy B = Y%. Two-proportion z-test result. 95% CI on the diff. Per-bucket breakdown (per tense, per person, per verb).
- Latency (from A/B). p50 A vs B, p95 A vs B, p99 A vs B. Significance via Mann-Whitney U on the latency samples.
- Diff details. A and B disagreed on N/200 cases. Of those, B improved on M (e.g., A said "I goed" was correct, B said "I went"), regressed on K (e.g., A correctly accepted "I went", B mangled it), neutral on N-M-K.
- Operational recommendation. Based on the above + the CpQU from lab 03 (write the placeholder; fill in after lab 03): promote B / hold B / re-train.
Block D — manifest + audit hook + observability check¶
-
manifest.jsonlists: the two registered canonical SHAs used; the Phase 20 eval set DVC hash; the traffic config; seeds. - Append a row to
security/THREATS.md: shadow-routing introduces a new code path that runs B unconditionally — confirm B's output never leaks to the client even on a B-side exception. Test with a B that always raises; A's response must still reach the client and latency must not regress. - Confirm that
src/miniobserve/(Phase 34) sees thetraffic.armspan attribute on every trace. Open the Grafana dashboard from Phase 34 lab 03; filter bytraffic.arm == "shadow_B"; confirm panels populate.
Block E — Justfile recipes¶
- Add
just shadow-on <production_sha> <shadow_sha>to flip the endpoint into shadow mode by writing aroute_config.jsonthatsrc/miniserve/reads at startup. - Add
just shadow-offto revert to single-arm production. - Document both in the lab
README.md.
Constraints¶
- Stateless router. The decision is a pure function of
(request_id, salt)and the route config. No DB, no Redis. (Phase 33 lab 02 already enforces this.) - Sticky assignment. A given
request_idalways lands on the same arm within a salt epoch. Verify by sending the samerequest_idtwice; both must route to the same model. - No quality A/B online. The grading is offline against the canonical Phase 20 table. If your A/B traces require expert review, the experiment is malformed.
- Run on CPU. Both models are small enough to serve from CPU on Borja's hardware. If latency is too slow to get 200 requests in a reasonable time, batch them.
- No new
src/<module>/. Addtraffic.pyinside the existingsrc/miniserve/. Updatesrc/miniserve/BLUEPRINT.mdwith a "Phase 38 extensions" section listing this addition.
Stop conditions¶
Done when:
traffic.loghas 400 entries (200 shadow + 200 A/B).report.mdincludes both quality (from shadow) and latency (from A/B) numbers with significance tests and per-bucket breakdowns.- The threat-model row is committed.
- The shadow B-side exception test passes (B crashing does not affect A).
- The Grafana dashboard shows the
traffic.armslice populated. - The operational recommendation in
report.mdis written (even if the placeholder reads "pending lab 03").
Pitfalls¶
- Threading the shadow computation. If you run B's inference synchronously on the same task that serves A, you've doubled latency. The shadow must be fire-and-forget (asyncio background task or thread pool). Verify by measuring user-facing latency — it should be ≈ A-only latency, not 2× A.
- Shadow exception leaks. If B raises and you
awaitit on the request path, the exception bubbles up and the client sees a 500. The handler must catch all exceptions from the shadow task and log them; the client only sees A's response. - Salt for hash routing. Use a fixed salt per environment (
prod-2026-05). Changing the salt re-shuffles assignment and pollutes A/B comparisons. request_idreuse. If the client doesn't send arequest_id, you must generate one server-side (UUIDv4). Don't hash the URL or input — that breaks user-level stickiness.- Tracking IDs in logs. Phase 34's structured logging must include the
traffic.armfield. If it doesn't, the Grafana dashboards Phase 39 will build won't be able to slice by traffic arm. Verify in the dashboard. - Grader version drift. The Phase 20 grader must be the same SHA across both A and B grading runs. The lab's
compare.pyshould assertgrader_sha_in_use == eval_baseline.json["grader_sha"]and fail loudly otherwise.
When to consult solutions/¶
After all five blocks are done. solutions/01-shadow-ab-ref.md (phase open) reviews your routing design, the shadow exception-handling pattern, and the report structure.
Next lab: lab/02-drift-detection.md.