English · Español

Phase 34 — Observability, Cost & Capacity¶

Requires: 33 — Inference Serving: From FastAPI to Continuous Batching Teaches: observability · red-metrics · opentelemetry · prometheus · grafana · cost-accounting Jump to any chapter from the phase reference index.

Chapter map¶

Pre-written per A12. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.

🇪🇸 Después de servir el modelo en Fase 33, ahora hay que verlo servir. RED (qué hace el servicio), USE (qué hace el hierro), trazas (qué pasó en una petición concreta) y coste (cuánto valió). Una observabilidad sin medir coste es un dashboard bonito; con coste, es ingeniería.

Goal¶

Wire instrumentation around the Phase 33 inference server such that, after 5 minutes of synthetic load, a single Grafana dashboard answers four questions:

Rate / errors / latency of the service.
Resource saturation (CPU, RAM, queue depth, KV-cache slots).
One request's trace — every span from HTTP-in to HTTP-out.
Cost of that request, broken into prefill, decode, and retrieval.

The phase introduces src/observability/ as the canonical home for metrics, traces, structured logs, and the cost tracker.

Read order¶

theory/00-motivation.md — why "make it run" and "make it observable" are different problems.
theory/01-red-use-metrics.md — the two metric philosophies, what they measure, what they miss, and how they combine for LLM serving.
theory/02-cost-accounting.md — the cost formula, why p95 cost matters more than mean cost, the trap of label cardinality.
theory/03-tracing-and-logging.md — OpenTelemetry spans, context propagation, the trace_id/span_id/request_id triad in structlog.
lab/00-prom-grafana-up.md — bring Prometheus + Grafana up locally via docker-compose. One-shot bootstrap.
lab/01-instrument-server.md — add the six core metrics to the Phase 33 server.
lab/02-tracing-end-to-end.md — wire OTel through the request path; visualize a single trace in Tempo.
lab/03-cost-and-loadtest.md — implement the cost tracker; run a load test; populate the dashboard; commit the screenshot.

solutions/ is empty during pre-write — populated at phase open after Borja's Phase 33 server is in place.

Definition of Done¶

See PHASE_34_PLAN.md §6. Briefly:

src/observability/ module wired into the Phase 33 server.
Dashboard JSON committed under infra/grafana/dashboards/llm.json.
A 5-minute, ≥100-RPS load test screenshot in experiments/34-load-test/.
Cost per 1k output tokens reported as a single number with a p95 (e.g., "mean €0.0083, p95 €0.041").
/quiz 34 ≥ 70%.

What this phase intentionally does NOT cover¶

Distributed observability across multiple nodes. Phase 35 territory (since that's where multiple nodes appear).
GPU-specific metrics (nvidia-smi-style USE). Also Phase 35.
Long-term storage of metrics (Mimir, Thanos). YAGNI for a learning project; in-memory Prometheus is fine.
APM-style code-level profiling (py-spy, pyroscope). Touched in Phase 33 already; not re-introduced here.
Alerting + paging (Alertmanager, PagerDuty). Phase 38 territory.
Tracking model-quality metrics over time (drift, regression). Phase 38 + Phase 40.

Phase 34's scope is the operational observability of a single-node LLM serving stack. Nothing more.