English · Español
Lab 02 — Load and shadow¶
🇪🇸 La demo de una sola petición prueba la corrección; este lab prueba el comportamiento bajo concurrencia. Diez clientes simultáneos durante 60 segundos, y un shadow del adapter LoRA de Phase 38 corriendo en paralelo al baseline para comparar latencia y coste sin afectar al usuario.
Goal¶
Run a controlled load test (10 concurrent clients, 60 s) against the live demo stack. Verify the p95 < 5 s DoD target holds. Simultaneously run a shadow of the Phase 38 promoted LoRA variant — same input, response not returned to the user, latency + accuracy logged — and produce the comparison dashboard screenshot.
Why this lab exists¶
A single-request demo proves the stack works. A 10-concurrent demo proves it holds up. Two specific properties are verified that the single-request lab cannot:
- Queueing under contention (Theory 02 §Little's Law). At 10 concurrent and mean response ~3.3 s, the system needs ≥ 3 req/s sustained throughput. If queue depth grows unbounded, the demo's p95 panel will show it before the DoD test fails.
- Shadow comparison. Phase 38 promoted a LoRA-tuned variant; Phase 39 verifies that variant's improvements (accuracy, latency, cost) on live traffic without exposing users to a regression risk.
Deliverables¶
scripts/demo/load.py— wrk-style load generator with 10 workers, 60 s window; parameterizes payload set, sample-rate, baseline vs shadow.infra/compose/full-stack-shadow.yml— extendsfull-stack.ymlwith a secondminiserveinstance loading the shadow LoRA adapter (different volume mount).experiments/39-load-and-shadow/load-baseline.json— per-request log of the 60-second baseline run.experiments/39-load-and-shadow/load-shadow.json— same for the shadow run.experiments/39-load-and-shadow/comparison.md— table comparing baseline vs shadow on p50/p95/p99/cost/accuracy.experiments/39-load-and-shadow/dashboard-shadow.png— Grafana screenshot with both variants visualized.tests/integration/test_load_dod.py— pytest that asserts p95 < 5 s under 10-concurrent.
Step 1 — Load generator¶
scripts/demo/load.py:
import asyncio, httpx, time, json, random
from pathlib import Path
PAYLOADS = [json.loads(p.read_text()) for p in Path("scripts/demo/payloads").glob("happy-path-*.json")]
async def worker(client, base_url, results, deadline):
while time.time() < deadline:
payload = random.choice(PAYLOADS)
t0 = time.perf_counter()
try:
r = await client.post(f"{base_url}/v1/grammar/correct", json=payload, timeout=10.0)
t1 = time.perf_counter()
results.append({
"duration_ms": (t1 - t0) * 1000,
"status": r.status_code,
"cost_eur": r.json().get("metadata", {}).get("cost_eur"),
"trace_id": r.headers.get("X-Trace-Id"),
})
except Exception as e:
results.append({"duration_ms": None, "status": "error", "error": str(e)})
async def run(base_url, concurrency=10, duration_s=60):
deadline = time.time() + duration_s
results = []
async with httpx.AsyncClient() as client:
await asyncio.gather(*[worker(client, base_url, results, deadline) for _ in range(concurrency)])
return results
The load is steady-state (workers loop, no inter-arrival distribution); for the demo this is enough. Phase 40 reading: bursty + Poisson-distributed load patterns.
Step 2 — Baseline run¶
$ just demo-cold-up
$ uv run python scripts/demo/load.py --concurrency 10 --duration 60 \
--base-url http://localhost:8080 --output experiments/39-load-and-shadow/load-baseline.json
Read the output:
Expected:
| Metric | Target | Typical |
|---|---|---|
| Total requests | n/a | ~180 |
| Error rate | < 1% | 0 |
| p50 latency | < 3 s | ~3.0 s |
| p95 latency | < 5 s (DoD) | ~4.5 s |
| p99 latency | < 7 s | ~5.8 s |
| Mean cost / req | n/a | €0.00042 |
| Total cost (60 s) | n/a | €0.076 |
If p95 ≥ 5 s, the DoD check fails. Open experiments/39-load-and-shadow/log.md and investigate:
- Queue depth panel: did it climb? Use Phase 33's continuous-batching profile.
- Per-stage latency: which stage's tail blew up? Decode is the usual suspect.
- CPU utilization on Borja's i5-8250U: at 100%, the bottleneck is hardware, not the code.
If hardware-bound: document the limit in comparison.md and lower the concurrency to 6. The DoD target's "10 concurrent" assumes a baseline server; Borja's laptop may need adjustment per the Plan's open-questions §6.
Step 3 — Shadow variant setup¶
Phase 38 promoted a LoRA variant whose adapter is at artifacts/lora/grammar-promoted-rev-sha:a1b2c3.safetensors. To run it as a shadow:
infra/compose/full-stack-shadow.yml:
name: lynx-cortex-demo-shadow
include:
- full-stack.yml
services:
miniserve-shadow:
extends:
file: ./miniserve.yml
service: miniserve
container_name: lynx-miniserve-shadow
ports:
- "8081:8080"
environment:
- MINISERVE_VARIANT=shadow
- LORA_ADAPTER_PATH=/models/grammar-promoted-rev-sha:a1b2c3.safetensors
volumes:
- ../../artifacts/lora:/models:ro
The shadow listens on :8081. The baseline serves user traffic on :8080. The load generator sends to both:
# Modified load.py:
await asyncio.gather(
run(base_url="http://localhost:8080", ...), # baseline
run(base_url="http://localhost:8081", ...), # shadow (same payloads)
)
Critical: the shadow's response is not returned to a real user. It's logged for comparison only. Phase 38 already implemented this contract; Phase 39 wires the compose-level routing.
Step 4 — Side-by-side comparison¶
experiments/39-load-and-shadow/comparison.md:
# Baseline vs Shadow — load run 2026-06-XX
## Setup
- Concurrency: 10
- Duration: 60 s
- Payloads: 5 happy-path sentences (random sampling)
- Baseline: base Mini-GPT (no LoRA)
- Shadow: Mini-GPT + LoRA `grammar-promoted-rev-sha:a1b2c3`
## Latency
| Percentile | Baseline | Shadow | Δ |
|---|---|---|---|
| p50 | 3.05 s | 3.20 s | +5% |
| p95 | 4.61 s | 4.92 s | +7% |
| p99 | 5.82 s | 6.18 s | +6% |
## Cost
| Metric | Baseline | Shadow | Δ |
|---|---|---|---|
| Mean cost / req | €0.00042 | €0.00045 | +7% |
| Total cost | €0.076 | €0.081 | +7% |
## Accuracy (vs Phase 20 ground-truth labels for sampled payloads)
| Metric | Baseline | Shadow | Δ |
|---|---|---|---|
| Correction accuracy | 0.76 | 0.91 | **+15pp** |
| Spanish-translation accuracy | 0.82 | 0.94 | +12pp |
## Verdict
Shadow trades ~7% on latency and cost for **+15pp accuracy**. CpQU (Phase 38)
should be the deciding metric — load shadow data into the Phase 38 CpQU
aggregator and check whether the trade is favorable.
This table is the single most important artifact of Lab 02. Phase 38 already taught CpQU as the lens for promotion decisions; Phase 39 makes it operational by feeding it real load data.
Step 5 — Dashboard with both variants¶
The Grafana dashboard's panels already accept a variant label. After the shadow run, the panels split:
- Latency histogram has two overlapping distributions (baseline blue, shadow orange).
- Cost per request panel shows two lines.
- Per-stage latency stacked bar shows two columns side-by-side.
Screenshot when both populated. Commit as experiments/39-load-and-shadow/dashboard-shadow.png.
Step 6 — DoD assertion¶
tests/integration/test_load_dod.py:
def test_p95_under_5s_at_10_concurrent(stack):
"""DoD: p95 latency under 5 s with 10 concurrent clients."""
results = run_load(stack.base_url, concurrency=10, duration_s=60)
successful = [r for r in results if r["status"] == 200]
durations_ms = [r["duration_ms"] for r in successful]
p95 = np.percentile(durations_ms, 95)
assert p95 < 5000, f"p95={p95:.0f} ms exceeds 5000 ms DoD target"
This runs in CI on every PR. A regression that pushes p95 over 5 s blocks merge.
What "done" looks like¶
-
scripts/demo/load.pyexists;summarize_load.pyexists. -
infra/compose/full-stack-shadow.ymlexists;just demo-cold-up-shadowbrings both miniserve instances up. - Baseline 60 s × 10-concurrent run completed; results in
load-baseline.json. - Shadow 60 s × 10-concurrent run completed; results in
load-shadow.json. -
comparison.mdwritten with both tables. -
dashboard-shadow.pngcommitted. -
test_load_dod.pypassing in CI. - If hardware-bound, documented in
comparison.mdwith the adjusted concurrency.
Common pitfalls¶
- Running shadow on the same process as baseline. The whole point is isolation; if they share a process, latency comparisons are noise. Separate containers, separate ports.
- Forgetting to flush traces. With high request rate, OTel batches; the last few seconds of trace data may not appear in Tempo before teardown. Sleep 5 s before tear-down or send an OTel
force_flush. - Comparing accuracy on a tiny sample. 5 payloads × 60 s with random sampling gives ~36 hits per payload — enough for a directional signal, not enough for a publication number. Document the sample size; don't oversell the +15pp.
- Reading p95 from the dashboard before all data arrives. Wait 60 s after load ends; recompute from
load-*.jsonfor ground truth. - Promoting the shadow inside Phase 39. The CpQU verdict feeds Phase 38's promotion process. Phase 39 ships the data; Phase 38 decides.
Next: lab/03-security-runthrough.md — replay three threat-model rows through the live service.