English · Español

Theory 03 — Cost and observability, stitched¶

🇪🇸 Tres piezas separadas — el emisor de coste por petición de Phase 34, la tabla CpQU de Phase 38, y el stack Prometheus/Grafana/Tempo — se unen aquí en un único panel. La regla central: una sola fuente de verdad por número. El coste se mide en un lugar; los demás lo leen. Si dos componentes calculan lo mismo, el panel mostrará 2× lo real y nadie se dará cuenta hasta que sea tarde.

Why stitching is hard¶

Each observability piece, built in isolation, works. Together they over-count, under-count, or double-publish — unless the boundaries are designed.

The three pieces:

Phase 34's cost emitter. A FastAPI middleware that wraps each request, measures wall time per stage, multiplies by r_cpu (€/s on this hardware), and emits a Prometheus histogram (lynx_cost_eur_per_request) plus a span attribute on the request trace.
Phase 38's CpQU table. A weekly batch job that aggregates quality-adjusted cost (cost ÷ accuracy ÷ user value weight) into a markdown table at docs/COSTS.md. CpQU = Cost per Quality Unit.
The observability stack. Prometheus scrapes lynx_cost_eur_per_request; Grafana visualizes; Tempo stores traces; the single dashboard infra/grafana/dashboards/capstone.json joins everything.

Phase 39's job is the contract between them. Six concrete rules below.

Rule 1: One source of truth per number¶

The cost of one request, in Euros, exists in exactly one place: Phase 34's emitter. Every downstream reader (Grafana panel, CpQU aggregator, audit log) reads from that emitter's output. No downstream reader recomputes the cost from raw wall time.

Why this matters: recomputation drifts. If Grafana computes duration_seconds * 0.0001 while the emitter computes duration_seconds * r_cpu(t) where r_cpu(t) is hardware-tier-aware, the numbers diverge. Users see "Grafana says €0.001, the audit log says €0.0008" and trust collapses.

Enforcement: the capstone's docs/COSTS.md lab (Phase 38 lab 03) is input-only for the dashboard; the dashboard does not invert it. The CpQU aggregator reads lynx_cost_eur_per_request_total directly from Prometheus, never from request logs.

Rule 2: Span attributes for joinable cost¶

Each request emits one span (http.request) carrying cost_eur as an attribute. Tempo stores the span; Grafana queries Tempo for traces and renders cost-per-trace.

This avoids the alternative — emitting a Prometheus metric and a log line and a span attribute — which triples storage and creates a JOIN problem ("which line corresponds to which trace?"). Span attribute is the canonical carrier; everything else derives.

The emitter's exact action per request:

span = current_span()
span.set_attribute("cost.eur", request_cost)
span.set_attribute("cost.stage.tokenize.eur", stage_costs["tokenize"])
span.set_attribute("cost.stage.retrieve.eur", stage_costs["retrieve"])
# ... one attribute per stage
cost_histogram.observe(request_cost)   # Prometheus, for time-series

The Prometheus histogram exists because traces are sampled (typically 1%) and the cost time-series must be exact. Both views are populated from one computation.

Rule 3: Cost decomposition identity¶

The Plan's §2 identity:

\[\text{cost}_{\text{req}} = \sum_{\text{stages}} \text{cost}_{\text{stage}} = \sum_{\text{stages}} T_{\text{stage}} \cdot r_{\text{cpu}}\]

The capstone verifies this at runtime within each request. The emitter computes both sides; if they disagree by > 0.1%, the request logs a cost_identity_violation event and the panel flashes red.

This catches:

A new stage added without registering its timing with the emitter (its time accrues to "other" — the identity holds only if "other" is bounded).
A stage that double-counts (its time is included in two with timer(): blocks).
A stage that emits cost but doesn't appear in the stage list (orphan stage cost — total < sum).

The audit lives in tests/integration/test_cost_identity.py and runs on every PR.

Rule 4: Sampling rates are explicit, not implicit¶

Three sampling rates live in the demo, and each one is named in docs/COSTS.md:

Stream	Default sample rate	Effect on cost calculation
Prometheus scrape	100% (every request)	Cost time-series is exact
Tempo trace ingest	10% in CI, 100% in `just demo`	Trace-attached cost is an estimate (off `just demo`)
Langfuse LLM-trace	100% (when present)	Cost-per-token estimate is exact when Langfuse is up

The Grafana dashboard's Cost-per-request panel reads from Prometheus (100%); the Cost-per-stage panel reads from Tempo spans (sample-rate aware) with an explicit "n requests sampled" label so the viewer knows the precision.

Failure mode: a viewer sees the per-stage panel and assumes 100% sampling, then complains that the per-stage sum doesn't equal the total. Phase 39 prevents this by always labeling sample rate on panels that aren't 100%.

Rule 5: CpQU is read-only in the demo¶

Phase 38's CpQU = €/(accuracy × user-value-weight). It's a weekly batch number, not a per-request one. The demo's dashboard renders the most recent CpQU value as a static text panel sourced from docs/COSTS.md. The demo does not recompute CpQU.

Why: CpQU requires accuracy from Phase 20's eval harness, which runs against the full eval set. Computing it live would mean running the eval set on every request — absurd. The static panel is the right shape.

The pitfall (Plan §5 #6): if the demo did recompute CpQU live, the dashboard would show different values during just demo vs the steady-state nightly job. Users wouldn't know which to believe.

Rule 6: Cost panels populate within 60 seconds¶

The DoD §7 #4: every panel populates within 60 seconds of the first request. This is a stress on the cost path specifically because:

Cost histogram needs at least one observation → first request must complete.
Prometheus scrape interval is 15 s by default → up to 15 s lag.
Grafana refresh on the dashboard is 5 s.
Histogram-quantile needs enough samples for the bucket to be non-empty.

The lab 00 sequence boots the stack, sends a single warm-up request, then waits 60 s and asserts every panel has data. If any panel is "No Data" after 60 s, that panel's underlying contract is broken — either the metric label is wrong, the time range is too short, or the scrape didn't fire. The lab walks the audit step by step.

Rule 7: Trace context survives the MCP sandbox boundary¶

For the security run-through (Lab 03), the MCP tool's subprocess must inherit traceparent. Mechanism (per Pitfall 5 of the Plan):

The agent loop, before spawning the subprocess, reads the current span's W3C trace-context (traceparent and tracestate headers).
It passes them as environment variables to the subprocess.
The subprocess's first action is to re-establish the trace context: tracer.start_span(..., parent=carrier_from_env()).
The subprocess's spans appear in Tempo as children of the agent's request span.

The "single dashboard" panel Orphan span count tracks spans without a parent. The demo's invariant: orphan_spans == 0 after every demo run. A non-zero count surfaces immediately on the dashboard.

The dashboard itself¶

infra/grafana/dashboards/capstone.json has exactly these panels (the demo's complete observability picture):

RED — Rate. Requests per second. From rate(lynx_http_requests_total[1m]).
RED — Errors. Error rate. From rate(lynx_http_requests_total{status=~"5.."}[1m]).
RED — Duration. p50, p95, p99 end-to-end. From histogram_quantile(0.95, lynx_http_request_duration_seconds_bucket).
Per-stage latency. Stacked bar; one row per stage; sample-rate annotated.
Latency budget vs actual. Two stacked bars, side by side, per-stage.
Cost per request. Histogram over the last hour, with p50 / p95 vertical lines.
Cost decomposition. Per-stage cost as a pie chart; rule-4 identity-check status.
CpQU. Static panel from docs/COSTS.md; week-over-week delta.
Trace orphan count. Single-stat; threshold ≤ 0.
Cost identity violations. Single-stat; threshold ≤ 0.

Ten panels. Every panel has a contract; every contract is testable. The lab 00 audit script runs each panel's query against Prometheus / Tempo and asserts non-empty results.

How CpQU connects to Phase 39's narrative¶

Phase 38 computes CpQU. Phase 39 displays it but, more importantly, interprets it for the demo viewer: "this run cost €0.00043 in the 90 seconds; the weekly CpQU is €0.000NNN per quality unit; the visitor sees that the curriculum produced not just a working artifact but a measured one."

The interpretation matters because the curriculum's final lesson is "ML systems are measured, not vibed." The single CpQU number, displayed on the dashboard during the demo's last 5 seconds, is the curriculum's signature.

What this theory does NOT cover¶

How Prometheus actually scrapes. Phase 34 theory.
How OpenTelemetry batches spans. Phase 34 theory.
How Grafana dashboards are built. Lab 00 walks the click-export-commit workflow.
The math of CpQU. Phase 38 theory 04.
Streaming vs batch evaluation. Phase 20 theory; here we just consume Phase 20's accuracy.
Cost models for GPU. Phase 35; the demo is CPU.

Next: theory/04-security-and-threat-model-closeout.md — which three threat rows the demo replays and why.