English · Español
01 — RED + USE metrics¶
🇪🇸 RED mide el servicio (lo que ven los usuarios). USE mide el hierro (lo que el servicio sufre). Juntas son completas; por separado son ciegas. Para LLMs hay que añadir métricas específicas — slots del KV cache, profundidad de cola del batcher — que no caen limpiamente en ninguna de las dos.
There are two metric philosophies that dominate operational monitoring, and they were developed independently:
- RED (Tom Wilkie, Weaveworks, ~2015) for services — rates, errors, durations.
- USE (Brendan Gregg, Netflix, ~2013) for resources — utilization, saturation, errors.
They are complementary. RED tells you what the user sees. USE tells you what's pushing back from the hardware. A mature observability story uses both. This page derives each, fits them to LLM serving, and lists the LLM-specific metrics that live on top.
RED¶
For each service (HTTP handler, gRPC method, internal queue), record three time series:
| Letter | Metric | Units | Prometheus type |
|---|---|---|---|
| R | Rate | requests / second | counter (rated via rate()) |
| E | Errors | errors / second, broken out by status | counter |
| D | Duration | latency distribution | histogram |
Why these three? Because together they characterize a service the way a SLO does: "99% of requests succeed in under T seconds."
- Rate alone tells you load; without errors, you don't know if load broke the service.
- Errors alone tell you brokenness; without rate, you don't know if it's bad-input or bad-service.
- Duration alone tells you slowness; without rate and errors, you don't know if slow + few succeeded > fast + all failed.
RED for LLM serving¶
For the Phase 33 server, RED instantiates as:
- Rate.
http_requests_total{endpoint, method, status}— Prometheus counter. PromQL:rate(http_requests_total[1m])gives RPS. - Errors. Already in the same counter — filter by
status="5xx". Separate counterllm_generation_errors_total{reason}for application errors (e.g.,prompt_too_long,kv_cache_full). - Duration. Two histograms:
request_duration_seconds— end-to-end (HTTP-in to HTTP-out).time_to_first_token_seconds(TTFT) — the LLM-specific one. Distinguishes "user is staring at a blank screen" from "user is reading the response".
Bucket choice matters. For LLMs:
13 buckets covering ~3.5 decades. Don't add more — each bucket is its own time series, and cardinality is your enemy.
USE¶
For each resource (CPU, RAM, disk, network NIC, GPU, queue), record three time series:
| Letter | Metric | Question |
|---|---|---|
| U | Utilization | what fraction of the resource is busy? |
| S | Saturation | how much extra work is queued waiting for the resource? |
| E | Errors | what error events did the resource emit? |
The USE method's clarity comes from a simple flowchart: for every resource, ask the three questions. If any answer is concerning, drill there first.
USE for an LLM serving box (CPU-only, Phase 34)¶
| Resource | U | S | E |
|---|---|---|---|
| CPU | node_cpu_seconds_total{mode="user\|system"} ratioed |
node_load1 |
node_cpu_seconds_total{mode="iowait"} (proxy) |
| RAM | node_memory_MemAvailable_bytes |
node_memory_SwapUsed_bytes |
node_vmstat_oom_kill |
| Disk | node_disk_io_time_seconds_total |
node_disk_io_now |
node_disk_io_errors_total (if exposed) |
| Batcher queue | llm_batcher_active_slots |
llm_batcher_queue_depth |
llm_batcher_admission_errors_total |
| KV cache | llm_kv_cache_slots_used |
llm_kv_cache_evictions_total |
n/a (allocation failures = batcher admission error) |
Most of the OS-level rows are provided "for free" by node_exporter. The LLM-specific rows — batcher queue, KV cache — you wire by hand in src/observability/metrics.py.
USE for a multi-GPU box (Phase 35 forward reference)¶
Adds:
| Resource | U | S | E |
|---|---|---|---|
| GPU SM | DCGM_FI_DEV_GPU_UTIL |
n/a (no queue concept) | DCGM_FI_DEV_XID_ERRORS |
| HBM bandwidth | DCGM_FI_DEV_MEM_COPY_UTIL |
n/a | DCGM_FI_DEV_MEM_ERRORS |
| HBM capacity | DCGM_FI_DEV_FB_USED / FB_TOTAL |
OOM events | OOM-killer events |
| NCCL collective | latency histogram | active-collective gauge | timeout counter |
NVIDIA DCGM exporter is the standard. We'll wire it in Phase 35; not in scope here.
LLM-specific metrics that don't fit RED or USE¶
Five metrics sit in their own category:
tokens_total{kind="prompt|completion"}— counter. Cumulative tokens served. Used to compute throughput in tokens/sec and cost per 1k tokens.time_to_first_token_seconds— histogram. Tail of this is what "feels slow" to a user during streaming.inter_token_latency_seconds— histogram. Steady-state generation speed.generation_length_tokens— histogram. Right-skewed distribution; lets you spot the "user asked for a 10k-token essay" outliers.cost_per_request_usd— histogram. Covered intheory/02-cost-accounting.md.
These five plus RED plus USE give you the complete LLM dashboard. Six panels for RED, ~6 for USE, 5 for LLM-specifics. ~17 panels in one dashboard.
A worked example: what alerts should you write?¶
Three alerts that any LLM service should have on day one:
- High error rate. PromQL:
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01for 5 minutes. Goal: catch a deployment that broke generation. - Slow TTFT p95.
histogram_quantile(0.95, rate(time_to_first_token_seconds_bucket[5m])) > 2for 5 minutes. Goal: catch prefill regressions. - Cost explosion.
histogram_quantile(0.95, rate(cost_per_1k_completion_tokens_usd_bucket[15m])) > 0.05for 15 minutes. Goal: catch a runaway prompt pattern.
These three are the on-call alerts for an LLM service.
Cardinality: the silent killer¶
Every label combination is a separate time series. Prometheus scales to ~1M active series on commodity hardware; past that, it falls over.
Labels with bounded cardinality (good):
endpoint(5–20 values)method(4 values)status(10s of values)model_name(a handful)tenant_idif you have a closed list
Labels that explode (bad):
user_id(unbounded)prompt_hash(unbounded)request_id(unbounded — this is whattrace_idis for, not metrics)prompt_first_100_chars(effectively unbounded)
Rule: if a label can take more than ~1000 distinct values across all time, it doesn't belong in a metric. It belongs in a log or a trace span attribute, where storage cost is linear in events not in series-count.
The Prometheus client mental model¶
For Borja's lab, the prometheus_client library exposes four metric types:
Counter: monotonically increasing..inc(). Use for rates and totals.Gauge: arbitrary value..set(),.inc(),.dec(). Use for things that go up and down (queue depth, slots used).Histogram: pre-bucketed..observe(value). Use for latency, cost, anything where you want percentiles.Summary: like histogram but with quantile estimation done client-side. Avoid — non-aggregable across instances. Use histograms.
For Phase 34 you use exactly three: Counter, Gauge, Histogram. Never Summary.
One-paragraph recap¶
RED (rate, errors, duration) characterizes a service by what its users see; USE (utilization, saturation, errors) characterizes a resource by what's pushing back from below. Together they're complete; alone they're each blind to half the picture. LLM serving adds five LLM-specific metrics (token counters, TTFT, ITL, generation length, cost-per-request) that don't fit cleanly into either. Histogram buckets must be LLM-shaped (sub-second to two-minute, 13 buckets). Labels must have bounded cardinality — per-user dimensions belong in traces/logs, not metrics.
Next: theory/02-cost-accounting.md.