Skip to content

English · Español

Lab 01 — Shadow + A/B routing on the grammar-tutor endpoint

Goal: wire src/miniserve/traffic.py (a file added inside the existing module, not a new module) into the Phase 33 server and produce a side-by-side correction-quality report for two grammar-tutor variants.

Estimated time: 4–6 hours.

Prereq: lab 00 done; src/miniserve/ accepts requests for the grammar-tutor endpoint (/v1/grammar/correct); Phase 20 eval harness can grade outputs against the canonical conjugation table.


What you produce

experiments/38-shadow-ab/ containing:

  • route.py — driver that wires shadow + A/B into the serving stack.
  • traffic.log — the routing decisions for each request (a JSONL).
  • compare.py — script that diffs A and B outputs and produces a report.
  • report.md — the side-by-side report (conjugation accuracy, latency p50/p95/p99, diff statistics).
  • manifest.json.

The scenario

Two registered models from lab 00: - A = Phase 18 FP32 baseline (semver v0.1.0). Plain Mini-GPT without grammar-tutor specialization. - B = Phase 28 LoRA grammar tutor (semver v0.3.0). LoRA adapter on top of A, trained on the verb conjugation grid.

You will:

  1. Configure the /v1/grammar/correct endpoint to shadow B against A for 200 requests drawn from the Phase 20 eval set.
  2. Reconfigure the same endpoint to A/B 50/50 for another 200 requests (a latency A/B, not a quality A/B — quality is graded offline against the canonical table).
  3. Produce a report comparing the two variants.

TODOs

Block A — shadow mode

  • Add src/miniserve/traffic.py with the stateless router described in theory/02. Public API: assign(request_id: str, route_config: RouteConfig) -> Assignment. RouteConfig carries strategy + arm SHAs + salt.
  • Configure /v1/grammar/correct so that on each incoming sentence:
  • The router decides "production_arm=A, shadow_arm=B".
  • The handler dispatches A's inference synchronously, serves A's correction to the client.
  • The handler dispatches B's inference as a background asyncio task (do not await it on the request path).
  • When B completes, logs both A and B corrections, both latencies, the assignment, and the trace ID to traffic.log (one JSON line per request).
  • Confirm via inspection: the user-facing latency is ≈ A's latency, not A + B.
  • Send 200 requests sampled from the Phase 20 eval set (use requests or httpx from a small Python client). Confirm 200 entries in traffic.log, each with both response_A and response_B.

Block B — A/B mode (latency-only)

  • Reconfigure /v1/grammar/correct to A/B 50/50 via hash(request_id, salt="prod-2026-05") mod 2.
  • Send 200 more requests. Confirm traffic.log now shows ~100 entries with A served and ~100 with B served (some variance expected from the hash).
  • Important: record only the served response and its latency. We're not collecting quality online — only latency. Online quality A/Bs for the grammar tutor are an anti-pattern (see theory/02).
  • Verify stickiness: send the same request_id twice in a row, confirm both responses come from the same arm.

Block C — offline grading and compare

  • In compare.py:
  • For the shadow traces (Block A), grade both A and B corrections against the Phase 20 canonical conjugation table. Report conjugation_accuracy_A vs conjugation_accuracy_B.
  • Compute the diff distribution: of the 200 requests, in how many did A and B produce different corrections? Of those, how many were correct for A vs correct for B?
  • For the A/B traces (Block B), compute latency percentiles (p50, p95, p99) per arm.
  • Write report.md:
  • Conjugation quality (from shadow). Accuracy A = X%, Accuracy B = Y%. Two-proportion z-test result. 95% CI on the diff. Per-bucket breakdown (per tense, per person, per verb).
  • Latency (from A/B). p50 A vs B, p95 A vs B, p99 A vs B. Significance via Mann-Whitney U on the latency samples.
  • Diff details. A and B disagreed on N/200 cases. Of those, B improved on M (e.g., A said "I goed" was correct, B said "I went"), regressed on K (e.g., A correctly accepted "I went", B mangled it), neutral on N-M-K.
  • Operational recommendation. Based on the above + the CpQU from lab 03 (write the placeholder; fill in after lab 03): promote B / hold B / re-train.

Block D — manifest + audit hook + observability check

  • manifest.json lists: the two registered canonical SHAs used; the Phase 20 eval set DVC hash; the traffic config; seeds.
  • Append a row to security/THREATS.md: shadow-routing introduces a new code path that runs B unconditionally — confirm B's output never leaks to the client even on a B-side exception. Test with a B that always raises; A's response must still reach the client and latency must not regress.
  • Confirm that src/miniobserve/ (Phase 34) sees the traffic.arm span attribute on every trace. Open the Grafana dashboard from Phase 34 lab 03; filter by traffic.arm == "shadow_B"; confirm panels populate.

Block E — Justfile recipes

  • Add just shadow-on <production_sha> <shadow_sha> to flip the endpoint into shadow mode by writing a route_config.json that src/miniserve/ reads at startup.
  • Add just shadow-off to revert to single-arm production.
  • Document both in the lab README.md.

Constraints

  • Stateless router. The decision is a pure function of (request_id, salt) and the route config. No DB, no Redis. (Phase 33 lab 02 already enforces this.)
  • Sticky assignment. A given request_id always lands on the same arm within a salt epoch. Verify by sending the same request_id twice; both must route to the same model.
  • No quality A/B online. The grading is offline against the canonical Phase 20 table. If your A/B traces require expert review, the experiment is malformed.
  • Run on CPU. Both models are small enough to serve from CPU on Borja's hardware. If latency is too slow to get 200 requests in a reasonable time, batch them.
  • No new src/<module>/. Add traffic.py inside the existing src/miniserve/. Update src/miniserve/BLUEPRINT.md with a "Phase 38 extensions" section listing this addition.

Stop conditions

Done when:

  1. traffic.log has 400 entries (200 shadow + 200 A/B).
  2. report.md includes both quality (from shadow) and latency (from A/B) numbers with significance tests and per-bucket breakdowns.
  3. The threat-model row is committed.
  4. The shadow B-side exception test passes (B crashing does not affect A).
  5. The Grafana dashboard shows the traffic.arm slice populated.
  6. The operational recommendation in report.md is written (even if the placeholder reads "pending lab 03").

Pitfalls

  • Threading the shadow computation. If you run B's inference synchronously on the same task that serves A, you've doubled latency. The shadow must be fire-and-forget (asyncio background task or thread pool). Verify by measuring user-facing latency — it should be ≈ A-only latency, not 2× A.
  • Shadow exception leaks. If B raises and you await it on the request path, the exception bubbles up and the client sees a 500. The handler must catch all exceptions from the shadow task and log them; the client only sees A's response.
  • Salt for hash routing. Use a fixed salt per environment (prod-2026-05). Changing the salt re-shuffles assignment and pollutes A/B comparisons.
  • request_id reuse. If the client doesn't send a request_id, you must generate one server-side (UUIDv4). Don't hash the URL or input — that breaks user-level stickiness.
  • Tracking IDs in logs. Phase 34's structured logging must include the traffic.arm field. If it doesn't, the Grafana dashboards Phase 39 will build won't be able to slice by traffic arm. Verify in the dashboard.
  • Grader version drift. The Phase 20 grader must be the same SHA across both A and B grading runs. The lab's compare.py should assert grader_sha_in_use == eval_baseline.json["grader_sha"] and fail loudly otherwise.

When to consult solutions/

After all five blocks are done. solutions/01-shadow-ab-ref.md (phase open) reviews your routing design, the shadow exception-handling pattern, and the report structure.


Next lab: lab/02-drift-detection.md.