English · Español
Theory 03 — Cost and observability, stitched¶
🇪🇸 Tres piezas separadas — el emisor de coste por petición de Phase 34, la tabla CpQU de Phase 38, y el stack Prometheus/Grafana/Tempo — se unen aquí en un único panel. La regla central: una sola fuente de verdad por número. El coste se mide en un lugar; los demás lo leen. Si dos componentes calculan lo mismo, el panel mostrará 2× lo real y nadie se dará cuenta hasta que sea tarde.
Why stitching is hard¶
Each observability piece, built in isolation, works. Together they over-count, under-count, or double-publish — unless the boundaries are designed.
The three pieces:
- Phase 34's cost emitter. A FastAPI middleware that wraps each request, measures wall time per stage, multiplies by
r_cpu(€/s on this hardware), and emits a Prometheus histogram (lynx_cost_eur_per_request) plus a span attribute on the request trace. - Phase 38's CpQU table. A weekly batch job that aggregates quality-adjusted cost (cost ÷ accuracy ÷ user value weight) into a markdown table at
docs/COSTS.md. CpQU = Cost per Quality Unit. - The observability stack. Prometheus scrapes
lynx_cost_eur_per_request; Grafana visualizes; Tempo stores traces; the single dashboardinfra/grafana/dashboards/capstone.jsonjoins everything.
Phase 39's job is the contract between them. Six concrete rules below.
Rule 1: One source of truth per number¶
The cost of one request, in Euros, exists in exactly one place: Phase 34's emitter. Every downstream reader (Grafana panel, CpQU aggregator, audit log) reads from that emitter's output. No downstream reader recomputes the cost from raw wall time.
Why this matters: recomputation drifts. If Grafana computes duration_seconds * 0.0001 while the emitter computes duration_seconds * r_cpu(t) where r_cpu(t) is hardware-tier-aware, the numbers diverge. Users see "Grafana says €0.001, the audit log says €0.0008" and trust collapses.
Enforcement: the capstone's docs/COSTS.md lab (Phase 38 lab 03) is input-only for the dashboard; the dashboard does not invert it. The CpQU aggregator reads lynx_cost_eur_per_request_total directly from Prometheus, never from request logs.
Rule 2: Span attributes for joinable cost¶
Each request emits one span (http.request) carrying cost_eur as an attribute. Tempo stores the span; Grafana queries Tempo for traces and renders cost-per-trace.
This avoids the alternative — emitting a Prometheus metric and a log line and a span attribute — which triples storage and creates a JOIN problem ("which line corresponds to which trace?"). Span attribute is the canonical carrier; everything else derives.
The emitter's exact action per request:
span = current_span()
span.set_attribute("cost.eur", request_cost)
span.set_attribute("cost.stage.tokenize.eur", stage_costs["tokenize"])
span.set_attribute("cost.stage.retrieve.eur", stage_costs["retrieve"])
# ... one attribute per stage
cost_histogram.observe(request_cost) # Prometheus, for time-series
The Prometheus histogram exists because traces are sampled (typically 1%) and the cost time-series must be exact. Both views are populated from one computation.
Rule 3: Cost decomposition identity¶
The Plan's §2 identity:
The capstone verifies this at runtime within each request. The emitter computes both sides; if they disagree by > 0.1%, the request logs a cost_identity_violation event and the panel flashes red.
This catches:
- A new stage added without registering its timing with the emitter (its time accrues to "other" — the identity holds only if "other" is bounded).
- A stage that double-counts (its time is included in two
with timer():blocks). - A stage that emits cost but doesn't appear in the stage list (orphan stage cost — total < sum).
The audit lives in tests/integration/test_cost_identity.py and runs on every PR.
Rule 4: Sampling rates are explicit, not implicit¶
Three sampling rates live in the demo, and each one is named in docs/COSTS.md:
| Stream | Default sample rate | Effect on cost calculation |
|---|---|---|
| Prometheus scrape | 100% (every request) | Cost time-series is exact |
| Tempo trace ingest | 10% in CI, 100% in just demo |
Trace-attached cost is an estimate (off just demo) |
| Langfuse LLM-trace | 100% (when present) | Cost-per-token estimate is exact when Langfuse is up |
The Grafana dashboard's Cost-per-request panel reads from Prometheus (100%); the Cost-per-stage panel reads from Tempo spans (sample-rate aware) with an explicit "n requests sampled" label so the viewer knows the precision.
Failure mode: a viewer sees the per-stage panel and assumes 100% sampling, then complains that the per-stage sum doesn't equal the total. Phase 39 prevents this by always labeling sample rate on panels that aren't 100%.
Rule 5: CpQU is read-only in the demo¶
Phase 38's CpQU = €/(accuracy × user-value-weight). It's a weekly batch number, not a per-request one. The demo's dashboard renders the most recent CpQU value as a static text panel sourced from docs/COSTS.md. The demo does not recompute CpQU.
Why: CpQU requires accuracy from Phase 20's eval harness, which runs against the full eval set. Computing it live would mean running the eval set on every request — absurd. The static panel is the right shape.
The pitfall (Plan §5 #6): if the demo did recompute CpQU live, the dashboard would show different values during just demo vs the steady-state nightly job. Users wouldn't know which to believe.
Rule 6: Cost panels populate within 60 seconds¶
The DoD §7 #4: every panel populates within 60 seconds of the first request. This is a stress on the cost path specifically because:
- Cost histogram needs at least one observation → first request must complete.
- Prometheus scrape interval is 15 s by default → up to 15 s lag.
- Grafana refresh on the dashboard is 5 s.
- Histogram-quantile needs enough samples for the bucket to be non-empty.
The lab 00 sequence boots the stack, sends a single warm-up request, then waits 60 s and asserts every panel has data. If any panel is "No Data" after 60 s, that panel's underlying contract is broken — either the metric label is wrong, the time range is too short, or the scrape didn't fire. The lab walks the audit step by step.
Rule 7: Trace context survives the MCP sandbox boundary¶
For the security run-through (Lab 03), the MCP tool's subprocess must inherit traceparent. Mechanism (per Pitfall 5 of the Plan):
- The agent loop, before spawning the subprocess, reads the current span's W3C trace-context (
traceparentandtracestateheaders). - It passes them as environment variables to the subprocess.
- The subprocess's first action is to re-establish the trace context:
tracer.start_span(..., parent=carrier_from_env()). - The subprocess's spans appear in Tempo as children of the agent's request span.
The "single dashboard" panel Orphan span count tracks spans without a parent. The demo's invariant: orphan_spans == 0 after every demo run. A non-zero count surfaces immediately on the dashboard.
The dashboard itself¶
infra/grafana/dashboards/capstone.json has exactly these panels (the demo's complete observability picture):
- RED — Rate. Requests per second. From
rate(lynx_http_requests_total[1m]). - RED — Errors. Error rate. From
rate(lynx_http_requests_total{status=~"5.."}[1m]). - RED — Duration. p50, p95, p99 end-to-end. From
histogram_quantile(0.95, lynx_http_request_duration_seconds_bucket). - Per-stage latency. Stacked bar; one row per stage; sample-rate annotated.
- Latency budget vs actual. Two stacked bars, side by side, per-stage.
- Cost per request. Histogram over the last hour, with p50 / p95 vertical lines.
- Cost decomposition. Per-stage cost as a pie chart; rule-4 identity-check status.
- CpQU. Static panel from
docs/COSTS.md; week-over-week delta. - Trace orphan count. Single-stat; threshold ≤ 0.
- Cost identity violations. Single-stat; threshold ≤ 0.
Ten panels. Every panel has a contract; every contract is testable. The lab 00 audit script runs each panel's query against Prometheus / Tempo and asserts non-empty results.
How CpQU connects to Phase 39's narrative¶
Phase 38 computes CpQU. Phase 39 displays it but, more importantly, interprets it for the demo viewer: "this run cost €0.00043 in the 90 seconds; the weekly CpQU is €0.000NNN per quality unit; the visitor sees that the curriculum produced not just a working artifact but a measured one."
The interpretation matters because the curriculum's final lesson is "ML systems are measured, not vibed." The single CpQU number, displayed on the dashboard during the demo's last 5 seconds, is the curriculum's signature.
What this theory does NOT cover¶
- How Prometheus actually scrapes. Phase 34 theory.
- How OpenTelemetry batches spans. Phase 34 theory.
- How Grafana dashboards are built. Lab 00 walks the click-export-commit workflow.
- The math of CpQU. Phase 38 theory 04.
- Streaming vs batch evaluation. Phase 20 theory; here we just consume Phase 20's accuracy.
- Cost models for GPU. Phase 35; the demo is CPU.
Next: theory/04-security-and-threat-model-closeout.md — which three threat rows the demo replays and why.