Skip to content

English · Español

Lab 01 — End-to-end grammar-tutor request

🇪🇸 Una petición HTTP, las nueve capas. Se sigue el trace_id, se mide cada etapa, se verifica la identidad del coste y se confirma que cada panel del dashboard reacciona. Si una etapa no aparece en el trace, la arquitectura mentía. Lo arreglas en su fase de origen.

Goal

Send one canonical POST /v1/grammar/correct request through the live stack from Lab 00. Capture the full trace tree. Verify every contract from Theory 01 (architecture) and Theory 02 (data flow). Compute the cost-identity and end-to-end-p95 numbers from Theory 03. Make every dashboard panel populate within 60 s.

Deliverables

  • experiments/39-end-to-end/request-walkthrough.md — a narrated transcript of the one request, with span IDs, timings, byte counts.
  • experiments/39-end-to-end/trace-export.json — exported from Tempo for the canonical request.
  • experiments/39-end-to-end/dashboard-screenshot.png — Grafana dashboard with all 10 panels populated.
  • tests/integration/test_request_contracts.py — twelve parameterized tests, one per contract from Theory 01's contract table.
  • tests/integration/test_cost_identity.py — per-request assertion that the cost identity holds within 0.1%.
  • tests/integration/test_percentile_arithmetic.py — one-shot demonstration of the fallacy (Theory 02).

Step 1 — The canonical request

Payload: {"sentence": "Yesterday I goed to the store"}. Stored under scripts/demo/payloads/happy-path-001.json.

Manual send (so you can read each piece of the response):

$ just demo-cold-up
$ curl -sS -X POST http://localhost:8080/v1/grammar/correct \
    -H "Content-Type: application/json" \
    -H "X-Request-Id: lab01-walkthrough-001" \
    --data @scripts/demo/payloads/happy-path-001.json | jq

Expected response shape (Phase 30 schema):

{
  "correction": "Yesterday I went to the store",
  "explanation": "Past simple of 'go' is irregular: 'went'.",
  "spanish_translation": "Ayer fui a la tienda",
  "cited_chunks": ["irregular-go-past-001"],
  "metadata": {
    "trace_id": "...",
    "cost_eur": 0.00041,
    "duration_ms": 3287
  }
}

The trace_id is the handle for the rest of the lab.

Step 2 — Walk the trace

$ curl -sS "http://localhost:3200/api/traces/$TRACE_ID" | jq > experiments/39-end-to-end/trace-export.json

Expected spans (from Theory 02's nine layers + Theory 01's contract):

Span name Parent Duration (typ) Critical attributes
http.request (root) ~3,300 ms request_id, cost.eur, http.status_code
validate.body http.request ~0.3 ms
security.check http.request ~0.2 ms security.allow=true
tokenize.bpe http.request ~1.0 ms tokens.n=10
retrieve.hybrid http.request ~25 ms chunks.top=5, top.chunk_id=irregular-go-past-001
model.prefill http.request ~800 ms seq_len=80
model.decode http.request ~2,500 ms tokens.generated=30
format.json http.request ~2.0 ms schema.valid=true
cost.emit http.request ~1.5 ms cost.eur=0.00041, cost.identity_ok=true

If a span is missing: the architecture diagram lied. Open the originating phase's BLUEPRINT.md, find where the span is supposed to be emitted, file a fix-commit in that phase. Phase 39 does not patch.

Step 3 — Twelve contract tests

tests/integration/test_request_contracts.py walks the contract table from Theory 01 §contracts (12 producer/consumer pairs). One parameterized test:

import pytest
from .contracts import CONTRACTS

@pytest.mark.parametrize("contract", CONTRACTS, ids=lambda c: c.id)
def test_contract_holds(stack, contract):
    """Each producer/consumer pair in the demo path satisfies its contract."""
    inputs = contract.fixture_inputs()
    output = contract.producer(*inputs)
    contract.consumer(output)        # raises on contract violation
    assert contract.invariant(output)

CONTRACTS is a list of 12 dataclasses (built in lab 00). Each has id, producer, consumer, fixture_inputs(), invariant().

A failure here means an integration test for a previously-unsuspected drift fired. Open the contract; fix in the originating phase.

Step 4 — Cost identity

tests/integration/test_cost_identity.py:

def test_cost_identity_within_tolerance(stack):
    """Sum of per-stage costs equals total cost within 0.1%."""
    response = send_canonical_request(stack)
    trace = fetch_trace(response.headers["X-Trace-Id"])
    stage_costs = sum(s.attributes["cost.eur"] for s in trace.spans
                      if s.attributes.get("cost.stage"))
    total_cost = trace.root_span.attributes["cost.eur"]
    assert abs(stage_costs - total_cost) / total_cost < 0.001, \
        f"identity violated: stages={stage_costs:.6f} total={total_cost:.6f}"

The first run might fail by a few percent — usually because one stage is not registered with the cost emitter, or its time is double-counted. Walk the failing case, identify the orphan stage, fix in the originating phase, re-run.

When the test passes, annotate in the experiment log: 2026-06-XX — cost identity now holds within 0.04% on 5 consecutive runs.

Step 5 — Percentile arithmetic demo

tests/integration/test_percentile_arithmetic.py:

import numpy as np

def test_percentile_arithmetic_fallacy_is_demonstrated():
    """Sum of per-stage p95s overstates the true end-to-end p95."""
    rng = np.random.default_rng(42)
    n = 10_000
    # Two correlated stages with realistic distributions.
    A = rng.lognormal(mean=0.0, sigma=0.5, size=n) * 1000  # ms
    B = rng.lognormal(mean=-0.3, sigma=0.4, size=n) * 800
    naive = np.percentile(A, 95) + np.percentile(B, 95)
    true = np.percentile(A + B, 95)
    overstatement = (naive - true) / true
    assert overstatement > 0.05, "expected naive p95 to overstate by >5%"
    assert overstatement < 0.30, "overstatement suspiciously large; check distributions"

This is a demonstration test, not a regression test. Its job is to keep the lesson honest in the codebase. The lab walkthrough captures the printed numbers in request-walkthrough.md.

Step 6 — Dashboard populating

After the canonical request, wait 60 s, then verify each panel:

$ uv run python scripts/audit_dashboard.py --dashboard capstone

The script (Lab 00 wrote it, this lab uses it) queries Grafana for each panel's underlying query, runs the query against Prometheus / Tempo, and asserts non-empty result.

Failure modes:

  • "No Data" on cost panel → cost histogram has zero observations (the request didn't complete? cost emitter not firing?). Inspect lynx_cost_eur_per_request_count directly.
  • "No Data" on per-stage panel → trace ingest fell behind, or sample rate is non-100%. Check Tempo's ingest lag; bump sample rate to 100% for the demo run.
  • Orphan-span count > 0 → trace context propagation broke. Identify the orphan; trace its parent expectation; fix in the originating phase.

Take the dashboard screenshot once all 10 panels populate. Commit as experiments/39-end-to-end/dashboard-screenshot.png.

Step 7 — request-walkthrough.md

Structured transcript:

# Canonical request walkthrough — 2026-06-XX

## Request
POST /v1/grammar/correct
Body: {"sentence": "Yesterday I goed to the store"}
Trace ID: e21ab9...

## Stage timings (from trace)
| Stage | Duration | Bytes in / out |
|---|---|---|
| validate.body | 0.31 ms | 70 B / 70 B |
| security.check | 0.18 ms | n/a |
| tokenize.bpe | 1.04 ms | 30 B → 80 B |
| retrieve.hybrid | 26.5 ms | 80 B → 2 KB |
| model.prefill | 812 ms | 80 B → 5 MB (logits) |
| model.decode | 2421 ms | KV cache → 30 tokens |
| format.json | 1.91 ms | 80 B → 412 B |
| cost.emit | 1.43 ms | 412 B → 412 B + 2 KB metrics |
| **total** | **3263 ms** | — |

## Cost identity
Stage sum: €0.000412
Root span: €0.000412
Δ: 0.0% — OK

## Latency budget
Allocated 5,000 ms; consumed 3,263 ms; buffer 1,737 ms (35%).

## Findings
- All 12 contracts passed.
- All 9 spans present; trace tree well-formed.
- 0 orphan spans.
- Percentile-arithmetic demo: naive p95 sum overstated true p95 by 14.2% (n=10,000 synthetic samples).

What "done" looks like

  • experiments/39-end-to-end/request-walkthrough.md committed with full numbers.
  • experiments/39-end-to-end/trace-export.json committed.
  • experiments/39-end-to-end/dashboard-screenshot.png committed with all 10 panels populated.
  • tests/integration/test_request_contracts.py — 12 tests passing.
  • tests/integration/test_cost_identity.py — passing on 5 consecutive requests.
  • tests/integration/test_percentile_arithmetic.py — passing.
  • All findings of "missing span" or "broken contract" are fixed in their originating phase, not in Phase 39.

Common pitfalls

  1. Patching in Phase 39. As Lab 00. Every fix lands in the originating phase.
  2. Trusting one happy-path request. Send the same request 5 times; verify the trace tree, cost identity, and dashboard populate every time. One success is anecdote.
  3. Mis-reading "No Data" as "panel broken". Sometimes the time range is wrong; sometimes the metric label is lynx_cost_eur vs lynx_cost_eur_per_request. The audit script reports the exact failing query.
  4. Counting validation/security spans in the latency budget. They're real work, but ~1 ms each; for budget allocation, group them under a single "overhead" line.
  5. Skipping the percentile-arithmetic demo because "I already know that". The test is in the suite to prevent future contributors from rebuilding a naive sum-of-p95s panel. It's a guardrail.

Next: lab/02-load-and-shadow.md — 10-concurrent load + the Phase 38 shadow LoRA variant.