English · Español
Phase 34 — Observability, Cost & Capacity¶
Requires: 33 — Inference Serving: From FastAPI to Continuous Batching Teaches:
observability·red-metrics·opentelemetry·prometheus·grafana·cost-accountingJump to any chapter from the phase reference index.
Chapter map¶
Pre-written per A12. This phase entry exists before Borja begins study. Theory and lab problem statements are stable drafts; solutions are written just-in-time at phase open.
🇪🇸 Después de servir el modelo en Fase 33, ahora hay que verlo servir. RED (qué hace el servicio), USE (qué hace el hierro), trazas (qué pasó en una petición concreta) y coste (cuánto valió). Una observabilidad sin medir coste es un dashboard bonito; con coste, es ingeniería.
Goal¶
Wire instrumentation around the Phase 33 inference server such that, after 5 minutes of synthetic load, a single Grafana dashboard answers four questions:
- Rate / errors / latency of the service.
- Resource saturation (CPU, RAM, queue depth, KV-cache slots).
- One request's trace — every span from HTTP-in to HTTP-out.
- Cost of that request, broken into prefill, decode, and retrieval.
The phase introduces src/observability/ as the canonical home for metrics, traces, structured logs, and the cost tracker.
Read order¶
theory/00-motivation.md— why "make it run" and "make it observable" are different problems.theory/01-red-use-metrics.md— the two metric philosophies, what they measure, what they miss, and how they combine for LLM serving.theory/02-cost-accounting.md— the cost formula, why p95 cost matters more than mean cost, the trap of label cardinality.theory/03-tracing-and-logging.md— OpenTelemetry spans, context propagation, thetrace_id/span_id/request_idtriad instructlog.lab/00-prom-grafana-up.md— bring Prometheus + Grafana up locally via docker-compose. One-shot bootstrap.lab/01-instrument-server.md— add the six core metrics to the Phase 33 server.lab/02-tracing-end-to-end.md— wire OTel through the request path; visualize a single trace in Tempo.lab/03-cost-and-loadtest.md— implement the cost tracker; run a load test; populate the dashboard; commit the screenshot.
solutions/ is empty during pre-write — populated at phase open after Borja's Phase 33 server is in place.
Definition of Done¶
See PHASE_34_PLAN.md §6. Briefly:
src/observability/module wired into the Phase 33 server.- Dashboard JSON committed under
infra/grafana/dashboards/llm.json. - A 5-minute, ≥100-RPS load test screenshot in
experiments/34-load-test/. - Cost per 1k output tokens reported as a single number with a p95 (e.g., "mean €0.0083, p95 €0.041").
/quiz 34≥ 70%.
What this phase intentionally does NOT cover¶
- Distributed observability across multiple nodes. Phase 35 territory (since that's where multiple nodes appear).
- GPU-specific metrics (
nvidia-smi-style USE). Also Phase 35. - Long-term storage of metrics (Mimir, Thanos). YAGNI for a learning project; in-memory Prometheus is fine.
- APM-style code-level profiling (
py-spy,pyroscope). Touched in Phase 33 already; not re-introduced here. - Alerting + paging (Alertmanager, PagerDuty). Phase 38 territory.
- Tracking model-quality metrics over time (drift, regression). Phase 38 + Phase 40.
Phase 34's scope is the operational observability of a single-node LLM serving stack. Nothing more.
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📕 Site Reliability Engineering (the SRE Book) — Google · 2016. SLOs, RED/USE, and what to actually alert on.
- 📘 OpenTelemetry Documentation — CNCF · 2024. the tracing standard you instrument with.