Skip to content

English · Español

01 — RED + USE metrics

🇪🇸 RED mide el servicio (lo que ven los usuarios). USE mide el hierro (lo que el servicio sufre). Juntas son completas; por separado son ciegas. Para LLMs hay que añadir métricas específicas — slots del KV cache, profundidad de cola del batcher — que no caen limpiamente en ninguna de las dos.

There are two metric philosophies that dominate operational monitoring, and they were developed independently:

  • RED (Tom Wilkie, Weaveworks, ~2015) for services — rates, errors, durations.
  • USE (Brendan Gregg, Netflix, ~2013) for resources — utilization, saturation, errors.

They are complementary. RED tells you what the user sees. USE tells you what's pushing back from the hardware. A mature observability story uses both. This page derives each, fits them to LLM serving, and lists the LLM-specific metrics that live on top.

RED

For each service (HTTP handler, gRPC method, internal queue), record three time series:

Letter Metric Units Prometheus type
R Rate requests / second counter (rated via rate())
E Errors errors / second, broken out by status counter
D Duration latency distribution histogram

Why these three? Because together they characterize a service the way a SLO does: "99% of requests succeed in under T seconds."

  • Rate alone tells you load; without errors, you don't know if load broke the service.
  • Errors alone tell you brokenness; without rate, you don't know if it's bad-input or bad-service.
  • Duration alone tells you slowness; without rate and errors, you don't know if slow + few succeeded > fast + all failed.

RED for LLM serving

For the Phase 33 server, RED instantiates as:

  • Rate. http_requests_total{endpoint, method, status} — Prometheus counter. PromQL: rate(http_requests_total[1m]) gives RPS.
  • Errors. Already in the same counter — filter by status="5xx". Separate counter llm_generation_errors_total{reason} for application errors (e.g., prompt_too_long, kv_cache_full).
  • Duration. Two histograms:
  • request_duration_seconds — end-to-end (HTTP-in to HTTP-out).
  • time_to_first_token_seconds (TTFT) — the LLM-specific one. Distinguishes "user is staring at a blank screen" from "user is reading the response".

Bucket choice matters. For LLMs:

buckets = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, float("inf"))

13 buckets covering ~3.5 decades. Don't add more — each bucket is its own time series, and cardinality is your enemy.

USE

For each resource (CPU, RAM, disk, network NIC, GPU, queue), record three time series:

Letter Metric Question
U Utilization what fraction of the resource is busy?
S Saturation how much extra work is queued waiting for the resource?
E Errors what error events did the resource emit?

The USE method's clarity comes from a simple flowchart: for every resource, ask the three questions. If any answer is concerning, drill there first.

USE for an LLM serving box (CPU-only, Phase 34)

Resource U S E
CPU node_cpu_seconds_total{mode="user\|system"} ratioed node_load1 node_cpu_seconds_total{mode="iowait"} (proxy)
RAM node_memory_MemAvailable_bytes node_memory_SwapUsed_bytes node_vmstat_oom_kill
Disk node_disk_io_time_seconds_total node_disk_io_now node_disk_io_errors_total (if exposed)
Batcher queue llm_batcher_active_slots llm_batcher_queue_depth llm_batcher_admission_errors_total
KV cache llm_kv_cache_slots_used llm_kv_cache_evictions_total n/a (allocation failures = batcher admission error)

Most of the OS-level rows are provided "for free" by node_exporter. The LLM-specific rows — batcher queue, KV cache — you wire by hand in src/observability/metrics.py.

USE for a multi-GPU box (Phase 35 forward reference)

Adds:

Resource U S E
GPU SM DCGM_FI_DEV_GPU_UTIL n/a (no queue concept) DCGM_FI_DEV_XID_ERRORS
HBM bandwidth DCGM_FI_DEV_MEM_COPY_UTIL n/a DCGM_FI_DEV_MEM_ERRORS
HBM capacity DCGM_FI_DEV_FB_USED / FB_TOTAL OOM events OOM-killer events
NCCL collective latency histogram active-collective gauge timeout counter

NVIDIA DCGM exporter is the standard. We'll wire it in Phase 35; not in scope here.

LLM-specific metrics that don't fit RED or USE

Five metrics sit in their own category:

  1. tokens_total{kind="prompt|completion"} — counter. Cumulative tokens served. Used to compute throughput in tokens/sec and cost per 1k tokens.
  2. time_to_first_token_seconds — histogram. Tail of this is what "feels slow" to a user during streaming.
  3. inter_token_latency_seconds — histogram. Steady-state generation speed.
  4. generation_length_tokens — histogram. Right-skewed distribution; lets you spot the "user asked for a 10k-token essay" outliers.
  5. cost_per_request_usd — histogram. Covered in theory/02-cost-accounting.md.

These five plus RED plus USE give you the complete LLM dashboard. Six panels for RED, ~6 for USE, 5 for LLM-specifics. ~17 panels in one dashboard.

A worked example: what alerts should you write?

Three alerts that any LLM service should have on day one:

  1. High error rate. PromQL: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01 for 5 minutes. Goal: catch a deployment that broke generation.
  2. Slow TTFT p95. histogram_quantile(0.95, rate(time_to_first_token_seconds_bucket[5m])) > 2 for 5 minutes. Goal: catch prefill regressions.
  3. Cost explosion. histogram_quantile(0.95, rate(cost_per_1k_completion_tokens_usd_bucket[15m])) > 0.05 for 15 minutes. Goal: catch a runaway prompt pattern.

These three are the on-call alerts for an LLM service.

Cardinality: the silent killer

Every label combination is a separate time series. Prometheus scales to ~1M active series on commodity hardware; past that, it falls over.

Labels with bounded cardinality (good):

  • endpoint (5–20 values)
  • method (4 values)
  • status (10s of values)
  • model_name (a handful)
  • tenant_id if you have a closed list

Labels that explode (bad):

  • user_id (unbounded)
  • prompt_hash (unbounded)
  • request_id (unbounded — this is what trace_id is for, not metrics)
  • prompt_first_100_chars (effectively unbounded)

Rule: if a label can take more than ~1000 distinct values across all time, it doesn't belong in a metric. It belongs in a log or a trace span attribute, where storage cost is linear in events not in series-count.

The Prometheus client mental model

For Borja's lab, the prometheus_client library exposes four metric types:

  • Counter: monotonically increasing. .inc(). Use for rates and totals.
  • Gauge: arbitrary value. .set(), .inc(), .dec(). Use for things that go up and down (queue depth, slots used).
  • Histogram: pre-bucketed. .observe(value). Use for latency, cost, anything where you want percentiles.
  • Summary: like histogram but with quantile estimation done client-side. Avoid — non-aggregable across instances. Use histograms.

For Phase 34 you use exactly three: Counter, Gauge, Histogram. Never Summary.

One-paragraph recap

RED (rate, errors, duration) characterizes a service by what its users see; USE (utilization, saturation, errors) characterizes a resource by what's pushing back from below. Together they're complete; alone they're each blind to half the picture. LLM serving adds five LLM-specific metrics (token counters, TTFT, ITL, generation length, cost-per-request) that don't fit cleanly into either. Histogram buckets must be LLM-shaped (sub-second to two-minute, 13 buckets). Labels must have bounded cardinality — per-user dimensions belong in traces/logs, not metrics.


Next: theory/02-cost-accounting.md.