Skip to content

English · Español

00 — Motivation: why observability is not optional for AI serving

🇪🇸 "Hacerlo funcionar" y "saber qué está haciendo" son dos problemas distintos. En servicios web normales puedes saltarte la observabilidad y sobrevivir; en servicios LLM la latencia, el coste y los errores se mueven en órdenes de magnitud entre peticiones — sin observabilidad no operas, intuyes.

You finished Phase 33 with a continuous-batching inference server that beats static batching on throughput at fixed p95 latency. Congratulations: you now own a black box that takes HTTP requests and emits tokens. That is not enough.

This page is about why observability is a distinct phase rather than a corner of the serving phase.

The naive position

The naive position says: "logs and a /health endpoint are observability. I can grep for errors and curl /health to see if it's alive. Done."

For a stateless CRUD service handling 10 requests per second, the naive position is correct enough to survive. For an LLM service it is wrong in three ways simultaneously, and each one bites at a different scale.

Why LLM serving breaks the naive position

1. Per-request cost varies by orders of magnitude

A CRUD service serving a GET /users/42 and a GET /users/9999 does roughly the same work for both — index lookup, serialize, return. Wall-clock per request is uniform within a factor of 2 or 3.

An LLM service serving "what's 2+2?" and "explain the buffer overflow in this 4 KB C file step by step" does wildly different work:

  • The first is one decode step on a tiny KV cache. ~50 ms.
  • The second is a 4 KB prefill followed by a few-thousand-token decode. ~30 seconds.

The ratio is 600×. The default Prometheus histogram buckets cap at 10 seconds. By default, the second request is invisible to your latency histogram. Your dashboard will read "p99 < 10 s" and lie.

LLM serving requires deliberately shaped histogram buckets covering 4 decades of latency. That deliberateness is what this phase is about.

2. Cost is no longer a footnote

A web service has roughly fixed marginal cost per request — the box is on, the request adds a CPU-millisecond.

An LLM service has per-request cost that scales linearly with output length and quadratically with input length (because attention prefill is O(N²)). A single "write me a 5000-word essay" request can cost more than a thousand "what's 2+2"s combined.

You cannot reason about whether your service is sustainable, or whether a particular customer/test pattern is abusing you, without per-request cost numbers. Cost has to be a first-class metric, not a quarterly finance spreadsheet exercise.

This phase makes cost a histogram in Prometheus, exactly the same shape as latency. You'll query "p95 cost per 1k output tokens" with the same fluency you query "p95 latency".

3. Errors are no longer binary

A web service returns 2xx (good) or 5xx (bad). Counting them is sufficient.

An LLM service returns 200 OK with wrong, hallucinated, refused, off-topic, or truncated content. From a Prometheus error-rate perspective, all of these are 200s — errors that look like successes.

Phase 34 does not solve quality measurement (that's Phase 20 / Phase 38). But it lays the trace and log foundation that quality-measurement tools attach to: every response carries a trace_id; when Phase 38 runs an LLM-as-judge over responses, it joins the judgments back to traces and metrics via that id.

The cost-of-not-doing-this

Three concrete failure modes that hit any team that skips this phase.

Failure mode A — "latency is fine, why are users complaining?" Default histogram buckets cap below your actual long-tail. p99 reads "9.8 s" forever, no matter how slow real requests get. Users see 45 s; you see 9.8 s. You disagree with reality and lose.

Failure mode B — "we're losing money on this customer." Without per-request cost, you discover this from the monthly cloud bill, three weeks after the bleeding started. With per-request cost in Prometheus, you write a single PromQL alert: "alert when 95th-percentile cost per 1k tokens exceeds €0.10 for 10 minutes."

Failure mode C — "a request died somewhere in the pipeline; where?" Without distributed tracing, you have one log line per stage (tokenize, retrieve, prefill, decode), no correlation between them, and you grep through 100k log lines to find the right thread. With OpenTelemetry, you click the failing request's trace_id in Grafana and see the entire span tree, with timing per span.

These are not exotic; these are the first three things to break.

The structure of Phase 34

The phase has four units, each landing one piece of the operational picture:

Unit What Why
RED metrics Rate, Errors, Duration Service-level health
USE metrics Utilization, Saturation, Errors Resource-level health
Tracing OpenTelemetry spans + logs Per-request causality
Cost Histogram of $_per_req Sustainability

These four are not interchangeable; they answer different questions. A mistake the naive practitioner makes is "I have latency metrics, that's enough" — latency is RED only, and it tells you the service is slow without telling you why. USE tells you why (queue full, KV cache thrashing). Tracing tells you which specific request's path went wrong. Cost tells you whether the slowness is even worth fixing.

How this connects to the wider curriculum

  • Phase 33 built the server. Phase 34 instruments it.
  • Phase 35 introduces multiple nodes; tracing context now propagates across processes, and USE metrics now include GPU utilization. Phase 34's foundation must be propagable.
  • Phase 37 uses observability as a defense: anomalous traces (a request with 1000× normal token output, a span whose RAG retrieval returned suspect content) become attack signals.
  • Phase 38 does cost-aware capacity planning and autoscaling on tokens_per_second instead of CPU%. That needs both the cost histogram and the queue-depth gauge from Phase 34.
  • Phase 40 load-tests the full system; Phase 34's dashboards are the artifact of that load test.

So the metrics, traces, and cost framework you wire in this phase live in every subsequent phase. It is not throwaway scaffolding.

One-paragraph recap

Default observability defaults are calibrated for CRUD web services and lie about LLM workloads in three specific ways: histogram buckets cap below the long tail, cost is invisible despite varying by 100× per request, and "200 OK with garbage output" looks identical to "200 OK with correct output". Phase 34 fixes all three by introducing RED + USE metrics with LLM-shaped buckets, OpenTelemetry tracing through every pipeline stage, structured logging joined to traces by trace_id, and a CostTracker that emits per-request cost as a Prometheus histogram in microdollars. The phase is small in code volume but high-leverage: everything downstream depends on these foundations.


Next: theory/01-red-use-metrics.md.