Skip to content

English · Español

Phase 34 — Observability, Cost & Capacity

Requires: 33 — Inference Serving: From FastAPI to Continuous Batching Teaches: observability · red-metrics · opentelemetry · prometheus · grafana · cost-accounting Jump to any chapter from the phase reference index.

Chapter map

Pre-written per A12. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Después de servir el modelo en Fase 33, ahora hay que verlo servir. RED (qué hace el servicio), USE (qué hace el hierro), trazas (qué pasó en una petición concreta) y coste (cuánto valió). Una observabilidad sin medir coste es un dashboard bonito; con coste, es ingeniería.


Goal

Wire instrumentation around the Phase 33 inference server such that, after 5 minutes of synthetic load, a single Grafana dashboard answers four questions:

  1. Rate / errors / latency of the service.
  2. Resource saturation (CPU, RAM, queue depth, KV-cache slots).
  3. One request's trace — every span from HTTP-in to HTTP-out.
  4. Cost of that request, broken into prefill, decode, and retrieval.

The phase introduces src/observability/ as the canonical home for metrics, traces, structured logs, and the cost tracker.

Read order

  1. theory/00-motivation.md — why "make it run" and "make it observable" are different problems.
  2. theory/01-red-use-metrics.md — the two metric philosophies, what they measure, what they miss, and how they combine for LLM serving.
  3. theory/02-cost-accounting.md — the cost formula, why p95 cost matters more than mean cost, the trap of label cardinality.
  4. theory/03-tracing-and-logging.md — OpenTelemetry spans, context propagation, the trace_id/span_id/request_id triad in structlog.
  5. lab/00-prom-grafana-up.md — bring Prometheus + Grafana up locally via docker-compose. One-shot bootstrap.
  6. lab/01-instrument-server.md — add the six core metrics to the Phase 33 server.
  7. lab/02-tracing-end-to-end.md — wire OTel through the request path; visualize a single trace in Tempo.
  8. lab/03-cost-and-loadtest.md — implement the cost tracker; run a load test; populate the dashboard; commit the screenshot.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase 33 server is in place.

Definition of Done

See PHASE_34_PLAN.md §6. Briefly:

  • src/observability/ module wired into the Phase 33 server.
  • Dashboard JSON committed under infra/grafana/dashboards/llm.json.
  • A 5-minute, ≥100-RPS load test screenshot in experiments/34-load-test/.
  • Cost per 1k output tokens reported as a single number with a p95 (e.g., "mean €0.0083, p95 €0.041").
  • /quiz 34 ≥ 70%.

What this phase intentionally does NOT cover

  • Distributed observability across multiple nodes. Phase 35 territory (since that's where multiple nodes appear).
  • GPU-specific metrics (nvidia-smi-style USE). Also Phase 35.
  • Long-term storage of metrics (Mimir, Thanos). YAGNI for a learning project; in-memory Prometheus is fine.
  • APM-style code-level profiling (py-spy, pyroscope). Touched in Phase 33 already; not re-introduced here.
  • Alerting + paging (Alertmanager, PagerDuty). Phase 38 territory.
  • Tracking model-quality metrics over time (drift, regression). Phase 38 + Phase 40.

Phase 34's scope is the operational observability of a single-node LLM serving stack. Nothing more.

Further reading

Optional — enrichment, not required to pass the phase.