Skip to content

English · Español

Theory 03 — Cost and observability, stitched

🇪🇸 Tres piezas separadas — el emisor de coste por petición de Phase 34, la tabla CpQU de Phase 38, y el stack Prometheus/Grafana/Tempo — se unen aquí en un único panel. La regla central: una sola fuente de verdad por número. El coste se mide en un lugar; los demás lo leen. Si dos componentes calculan lo mismo, el panel mostrará 2× lo real y nadie se dará cuenta hasta que sea tarde.

Why stitching is hard

Each observability piece, built in isolation, works. Together they over-count, under-count, or double-publish — unless the boundaries are designed.

The three pieces:

  1. Phase 34's cost emitter. A FastAPI middleware that wraps each request, measures wall time per stage, multiplies by r_cpu (€/s on this hardware), and emits a Prometheus histogram (lynx_cost_eur_per_request) plus a span attribute on the request trace.
  2. Phase 38's CpQU table. A weekly batch job that aggregates quality-adjusted cost (cost ÷ accuracy ÷ user value weight) into a markdown table at docs/COSTS.md. CpQU = Cost per Quality Unit.
  3. The observability stack. Prometheus scrapes lynx_cost_eur_per_request; Grafana visualizes; Tempo stores traces; the single dashboard infra/grafana/dashboards/capstone.json joins everything.

Phase 39's job is the contract between them. Six concrete rules below.

Rule 1: One source of truth per number

The cost of one request, in Euros, exists in exactly one place: Phase 34's emitter. Every downstream reader (Grafana panel, CpQU aggregator, audit log) reads from that emitter's output. No downstream reader recomputes the cost from raw wall time.

Why this matters: recomputation drifts. If Grafana computes duration_seconds * 0.0001 while the emitter computes duration_seconds * r_cpu(t) where r_cpu(t) is hardware-tier-aware, the numbers diverge. Users see "Grafana says €0.001, the audit log says €0.0008" and trust collapses.

Enforcement: the capstone's docs/COSTS.md lab (Phase 38 lab 03) is input-only for the dashboard; the dashboard does not invert it. The CpQU aggregator reads lynx_cost_eur_per_request_total directly from Prometheus, never from request logs.

Rule 2: Span attributes for joinable cost

Each request emits one span (http.request) carrying cost_eur as an attribute. Tempo stores the span; Grafana queries Tempo for traces and renders cost-per-trace.

This avoids the alternative — emitting a Prometheus metric and a log line and a span attribute — which triples storage and creates a JOIN problem ("which line corresponds to which trace?"). Span attribute is the canonical carrier; everything else derives.

The emitter's exact action per request:

span = current_span()
span.set_attribute("cost.eur", request_cost)
span.set_attribute("cost.stage.tokenize.eur", stage_costs["tokenize"])
span.set_attribute("cost.stage.retrieve.eur", stage_costs["retrieve"])
# ... one attribute per stage
cost_histogram.observe(request_cost)   # Prometheus, for time-series

The Prometheus histogram exists because traces are sampled (typically 1%) and the cost time-series must be exact. Both views are populated from one computation.

Rule 3: Cost decomposition identity

The Plan's §2 identity:

\[\text{cost}_{\text{req}} = \sum_{\text{stages}} \text{cost}_{\text{stage}} = \sum_{\text{stages}} T_{\text{stage}} \cdot r_{\text{cpu}}\]

The capstone verifies this at runtime within each request. The emitter computes both sides; if they disagree by > 0.1%, the request logs a cost_identity_violation event and the panel flashes red.

This catches:

  • A new stage added without registering its timing with the emitter (its time accrues to "other" — the identity holds only if "other" is bounded).
  • A stage that double-counts (its time is included in two with timer(): blocks).
  • A stage that emits cost but doesn't appear in the stage list (orphan stage cost — total < sum).

The audit lives in tests/integration/test_cost_identity.py and runs on every PR.

Rule 4: Sampling rates are explicit, not implicit

Three sampling rates live in the demo, and each one is named in docs/COSTS.md:

Stream Default sample rate Effect on cost calculation
Prometheus scrape 100% (every request) Cost time-series is exact
Tempo trace ingest 10% in CI, 100% in just demo Trace-attached cost is an estimate (off just demo)
Langfuse LLM-trace 100% (when present) Cost-per-token estimate is exact when Langfuse is up

The Grafana dashboard's Cost-per-request panel reads from Prometheus (100%); the Cost-per-stage panel reads from Tempo spans (sample-rate aware) with an explicit "n requests sampled" label so the viewer knows the precision.

Failure mode: a viewer sees the per-stage panel and assumes 100% sampling, then complains that the per-stage sum doesn't equal the total. Phase 39 prevents this by always labeling sample rate on panels that aren't 100%.

Rule 5: CpQU is read-only in the demo

Phase 38's CpQU = €/(accuracy × user-value-weight). It's a weekly batch number, not a per-request one. The demo's dashboard renders the most recent CpQU value as a static text panel sourced from docs/COSTS.md. The demo does not recompute CpQU.

Why: CpQU requires accuracy from Phase 20's eval harness, which runs against the full eval set. Computing it live would mean running the eval set on every request — absurd. The static panel is the right shape.

The pitfall (Plan §5 #6): if the demo did recompute CpQU live, the dashboard would show different values during just demo vs the steady-state nightly job. Users wouldn't know which to believe.

Rule 6: Cost panels populate within 60 seconds

The DoD §7 #4: every panel populates within 60 seconds of the first request. This is a stress on the cost path specifically because:

  • Cost histogram needs at least one observation → first request must complete.
  • Prometheus scrape interval is 15 s by default → up to 15 s lag.
  • Grafana refresh on the dashboard is 5 s.
  • Histogram-quantile needs enough samples for the bucket to be non-empty.

The lab 00 sequence boots the stack, sends a single warm-up request, then waits 60 s and asserts every panel has data. If any panel is "No Data" after 60 s, that panel's underlying contract is broken — either the metric label is wrong, the time range is too short, or the scrape didn't fire. The lab walks the audit step by step.

Rule 7: Trace context survives the MCP sandbox boundary

For the security run-through (Lab 03), the MCP tool's subprocess must inherit traceparent. Mechanism (per Pitfall 5 of the Plan):

  1. The agent loop, before spawning the subprocess, reads the current span's W3C trace-context (traceparent and tracestate headers).
  2. It passes them as environment variables to the subprocess.
  3. The subprocess's first action is to re-establish the trace context: tracer.start_span(..., parent=carrier_from_env()).
  4. The subprocess's spans appear in Tempo as children of the agent's request span.

The "single dashboard" panel Orphan span count tracks spans without a parent. The demo's invariant: orphan_spans == 0 after every demo run. A non-zero count surfaces immediately on the dashboard.

The dashboard itself

infra/grafana/dashboards/capstone.json has exactly these panels (the demo's complete observability picture):

  1. RED — Rate. Requests per second. From rate(lynx_http_requests_total[1m]).
  2. RED — Errors. Error rate. From rate(lynx_http_requests_total{status=~"5.."}[1m]).
  3. RED — Duration. p50, p95, p99 end-to-end. From histogram_quantile(0.95, lynx_http_request_duration_seconds_bucket).
  4. Per-stage latency. Stacked bar; one row per stage; sample-rate annotated.
  5. Latency budget vs actual. Two stacked bars, side by side, per-stage.
  6. Cost per request. Histogram over the last hour, with p50 / p95 vertical lines.
  7. Cost decomposition. Per-stage cost as a pie chart; rule-4 identity-check status.
  8. CpQU. Static panel from docs/COSTS.md; week-over-week delta.
  9. Trace orphan count. Single-stat; threshold ≤ 0.
  10. Cost identity violations. Single-stat; threshold ≤ 0.

Ten panels. Every panel has a contract; every contract is testable. The lab 00 audit script runs each panel's query against Prometheus / Tempo and asserts non-empty results.

How CpQU connects to Phase 39's narrative

Phase 38 computes CpQU. Phase 39 displays it but, more importantly, interprets it for the demo viewer: "this run cost €0.00043 in the 90 seconds; the weekly CpQU is €0.000NNN per quality unit; the visitor sees that the curriculum produced not just a working artifact but a measured one."

The interpretation matters because the curriculum's final lesson is "ML systems are measured, not vibed." The single CpQU number, displayed on the dashboard during the demo's last 5 seconds, is the curriculum's signature.

What this theory does NOT cover

  • How Prometheus actually scrapes. Phase 34 theory.
  • How OpenTelemetry batches spans. Phase 34 theory.
  • How Grafana dashboards are built. Lab 00 walks the click-export-commit workflow.
  • The math of CpQU. Phase 38 theory 04.
  • Streaming vs batch evaluation. Phase 20 theory; here we just consume Phase 20's accuracy.
  • Cost models for GPU. Phase 35; the demo is CPU.

Next: theory/04-security-and-threat-model-closeout.md — which three threat rows the demo replays and why.