English · Español
Lab 01 — End-to-end grammar-tutor request¶
🇪🇸 Una petición HTTP, las nueve capas. Se sigue el trace_id, se mide cada etapa, se verifica la identidad del coste y se confirma que cada panel del dashboard reacciona. Si una etapa no aparece en el trace, la arquitectura mentía. Lo arreglas en su fase de origen.
Goal¶
Send one canonical POST /v1/grammar/correct request through the live stack from Lab 00. Capture the full trace tree. Verify every contract from Theory 01 (architecture) and Theory 02 (data flow). Compute the cost-identity and end-to-end-p95 numbers from Theory 03. Make every dashboard panel populate within 60 s.
Deliverables¶
experiments/39-end-to-end/request-walkthrough.md— a narrated transcript of the one request, with span IDs, timings, byte counts.experiments/39-end-to-end/trace-export.json— exported from Tempo for the canonical request.experiments/39-end-to-end/dashboard-screenshot.png— Grafana dashboard with all 10 panels populated.tests/integration/test_request_contracts.py— twelve parameterized tests, one per contract from Theory 01's contract table.tests/integration/test_cost_identity.py— per-request assertion that the cost identity holds within 0.1%.tests/integration/test_percentile_arithmetic.py— one-shot demonstration of the fallacy (Theory 02).
Step 1 — The canonical request¶
Payload: {"sentence": "Yesterday I goed to the store"}. Stored under scripts/demo/payloads/happy-path-001.json.
Manual send (so you can read each piece of the response):
$ just demo-cold-up
$ curl -sS -X POST http://localhost:8080/v1/grammar/correct \
-H "Content-Type: application/json" \
-H "X-Request-Id: lab01-walkthrough-001" \
--data @scripts/demo/payloads/happy-path-001.json | jq
Expected response shape (Phase 30 schema):
{
"correction": "Yesterday I went to the store",
"explanation": "Past simple of 'go' is irregular: 'went'.",
"spanish_translation": "Ayer fui a la tienda",
"cited_chunks": ["irregular-go-past-001"],
"metadata": {
"trace_id": "...",
"cost_eur": 0.00041,
"duration_ms": 3287
}
}
The trace_id is the handle for the rest of the lab.
Step 2 — Walk the trace¶
$ curl -sS "http://localhost:3200/api/traces/$TRACE_ID" | jq > experiments/39-end-to-end/trace-export.json
Expected spans (from Theory 02's nine layers + Theory 01's contract):
| Span name | Parent | Duration (typ) | Critical attributes |
|---|---|---|---|
http.request |
(root) | ~3,300 ms | request_id, cost.eur, http.status_code |
validate.body |
http.request |
~0.3 ms | — |
security.check |
http.request |
~0.2 ms | security.allow=true |
tokenize.bpe |
http.request |
~1.0 ms | tokens.n=10 |
retrieve.hybrid |
http.request |
~25 ms | chunks.top=5, top.chunk_id=irregular-go-past-001 |
model.prefill |
http.request |
~800 ms | seq_len=80 |
model.decode |
http.request |
~2,500 ms | tokens.generated=30 |
format.json |
http.request |
~2.0 ms | schema.valid=true |
cost.emit |
http.request |
~1.5 ms | cost.eur=0.00041, cost.identity_ok=true |
If a span is missing: the architecture diagram lied. Open the originating phase's BLUEPRINT.md, find where the span is supposed to be emitted, file a fix-commit in that phase. Phase 39 does not patch.
Step 3 — Twelve contract tests¶
tests/integration/test_request_contracts.py walks the contract table from Theory 01 §contracts (12 producer/consumer pairs). One parameterized test:
import pytest
from .contracts import CONTRACTS
@pytest.mark.parametrize("contract", CONTRACTS, ids=lambda c: c.id)
def test_contract_holds(stack, contract):
"""Each producer/consumer pair in the demo path satisfies its contract."""
inputs = contract.fixture_inputs()
output = contract.producer(*inputs)
contract.consumer(output) # raises on contract violation
assert contract.invariant(output)
CONTRACTS is a list of 12 dataclasses (built in lab 00). Each has id, producer, consumer, fixture_inputs(), invariant().
A failure here means an integration test for a previously-unsuspected drift fired. Open the contract; fix in the originating phase.
Step 4 — Cost identity¶
tests/integration/test_cost_identity.py:
def test_cost_identity_within_tolerance(stack):
"""Sum of per-stage costs equals total cost within 0.1%."""
response = send_canonical_request(stack)
trace = fetch_trace(response.headers["X-Trace-Id"])
stage_costs = sum(s.attributes["cost.eur"] for s in trace.spans
if s.attributes.get("cost.stage"))
total_cost = trace.root_span.attributes["cost.eur"]
assert abs(stage_costs - total_cost) / total_cost < 0.001, \
f"identity violated: stages={stage_costs:.6f} total={total_cost:.6f}"
The first run might fail by a few percent — usually because one stage is not registered with the cost emitter, or its time is double-counted. Walk the failing case, identify the orphan stage, fix in the originating phase, re-run.
When the test passes, annotate in the experiment log: 2026-06-XX — cost identity now holds within 0.04% on 5 consecutive runs.
Step 5 — Percentile arithmetic demo¶
tests/integration/test_percentile_arithmetic.py:
import numpy as np
def test_percentile_arithmetic_fallacy_is_demonstrated():
"""Sum of per-stage p95s overstates the true end-to-end p95."""
rng = np.random.default_rng(42)
n = 10_000
# Two correlated stages with realistic distributions.
A = rng.lognormal(mean=0.0, sigma=0.5, size=n) * 1000 # ms
B = rng.lognormal(mean=-0.3, sigma=0.4, size=n) * 800
naive = np.percentile(A, 95) + np.percentile(B, 95)
true = np.percentile(A + B, 95)
overstatement = (naive - true) / true
assert overstatement > 0.05, "expected naive p95 to overstate by >5%"
assert overstatement < 0.30, "overstatement suspiciously large; check distributions"
This is a demonstration test, not a regression test. Its job is to keep the lesson honest in the codebase. The lab walkthrough captures the printed numbers in request-walkthrough.md.
Step 6 — Dashboard populating¶
After the canonical request, wait 60 s, then verify each panel:
The script (Lab 00 wrote it, this lab uses it) queries Grafana for each panel's underlying query, runs the query against Prometheus / Tempo, and asserts non-empty result.
Failure modes:
- "No Data" on cost panel → cost histogram has zero observations (the request didn't complete? cost emitter not firing?). Inspect
lynx_cost_eur_per_request_countdirectly. - "No Data" on per-stage panel → trace ingest fell behind, or sample rate is non-100%. Check Tempo's ingest lag; bump sample rate to 100% for the demo run.
- Orphan-span count > 0 → trace context propagation broke. Identify the orphan; trace its parent expectation; fix in the originating phase.
Take the dashboard screenshot once all 10 panels populate. Commit as experiments/39-end-to-end/dashboard-screenshot.png.
Step 7 — request-walkthrough.md¶
Structured transcript:
# Canonical request walkthrough — 2026-06-XX
## Request
POST /v1/grammar/correct
Body: {"sentence": "Yesterday I goed to the store"}
Trace ID: e21ab9...
## Stage timings (from trace)
| Stage | Duration | Bytes in / out |
|---|---|---|
| validate.body | 0.31 ms | 70 B / 70 B |
| security.check | 0.18 ms | n/a |
| tokenize.bpe | 1.04 ms | 30 B → 80 B |
| retrieve.hybrid | 26.5 ms | 80 B → 2 KB |
| model.prefill | 812 ms | 80 B → 5 MB (logits) |
| model.decode | 2421 ms | KV cache → 30 tokens |
| format.json | 1.91 ms | 80 B → 412 B |
| cost.emit | 1.43 ms | 412 B → 412 B + 2 KB metrics |
| **total** | **3263 ms** | — |
## Cost identity
Stage sum: €0.000412
Root span: €0.000412
Δ: 0.0% — OK
## Latency budget
Allocated 5,000 ms; consumed 3,263 ms; buffer 1,737 ms (35%).
## Findings
- All 12 contracts passed.
- All 9 spans present; trace tree well-formed.
- 0 orphan spans.
- Percentile-arithmetic demo: naive p95 sum overstated true p95 by 14.2% (n=10,000 synthetic samples).
What "done" looks like¶
-
experiments/39-end-to-end/request-walkthrough.mdcommitted with full numbers. -
experiments/39-end-to-end/trace-export.jsoncommitted. -
experiments/39-end-to-end/dashboard-screenshot.pngcommitted with all 10 panels populated. -
tests/integration/test_request_contracts.py— 12 tests passing. -
tests/integration/test_cost_identity.py— passing on 5 consecutive requests. -
tests/integration/test_percentile_arithmetic.py— passing. - All findings of "missing span" or "broken contract" are fixed in their originating phase, not in Phase 39.
Common pitfalls¶
- Patching in Phase 39. As Lab 00. Every fix lands in the originating phase.
- Trusting one happy-path request. Send the same request 5 times; verify the trace tree, cost identity, and dashboard populate every time. One success is anecdote.
- Mis-reading "No Data" as "panel broken". Sometimes the time range is wrong; sometimes the metric label is
lynx_cost_eurvslynx_cost_eur_per_request. The audit script reports the exact failing query. - Counting validation/security spans in the latency budget. They're real work, but ~1 ms each; for budget allocation, group them under a single "overhead" line.
- Skipping the percentile-arithmetic demo because "I already know that". The test is in the suite to prevent future contributors from rebuilding a naive sum-of-p95s panel. It's a guardrail.
Next: lab/02-load-and-shadow.md — 10-concurrent load + the Phase 38 shadow LoRA variant.