English · Español

01 — RED + USE metrics¶

🇪🇸 RED mide el servicio (lo que ven los usuarios). USE mide el hierro (lo que el servicio sufre). Juntas son completas; por separado son ciegas. Para LLMs hay que añadir métricas específicas — slots del KV cache, profundidad de cola del batcher — que no caen limpiamente en ninguna de las dos.

There are two metric philosophies that dominate operational monitoring, and they were developed independently:

RED (Tom Wilkie, Weaveworks, ~2015) for services — rates, errors, durations.
USE (Brendan Gregg, Netflix, ~2013) for resources — utilization, saturation, errors.

They are complementary. RED tells you what the user sees. USE tells you what's pushing back from the hardware. A mature observability story uses both. This page derives each, fits them to LLM serving, and lists the LLM-specific metrics that live on top.

RED¶

For each service (HTTP handler, gRPC method, internal queue), record three time series:

Letter	Metric	Units	Prometheus type
R	Rate	requests / second	counter (rated via `rate()`)
E	Errors	errors / second, broken out by status	counter
D	Duration	latency distribution	histogram

Why these three? Because together they characterize a service the way a SLO does: "99% of requests succeed in under T seconds."

Rate alone tells you load; without errors, you don't know if load broke the service.
Errors alone tell you brokenness; without rate, you don't know if it's bad-input or bad-service.
Duration alone tells you slowness; without rate and errors, you don't know if slow + few succeeded > fast + all failed.

RED for LLM serving¶

For the Phase 33 server, RED instantiates as:

Rate. http_requests_total{endpoint, method, status} — Prometheus counter. PromQL: rate(http_requests_total[1m]) gives RPS.
Errors. Already in the same counter — filter by status="5xx". Separate counter llm_generation_errors_total{reason} for application errors (e.g., prompt_too_long, kv_cache_full).
Duration. Two histograms:
request_duration_seconds — end-to-end (HTTP-in to HTTP-out).
time_to_first_token_seconds (TTFT) — the LLM-specific one. Distinguishes "user is staring at a blank screen" from "user is reading the response".

Bucket choice matters. For LLMs:

buckets = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, float("inf"))

13 buckets covering ~3.5 decades. Don't add more — each bucket is its own time series, and cardinality is your enemy.

USE¶

For each resource (CPU, RAM, disk, network NIC, GPU, queue), record three time series:

Letter	Metric	Question
U	Utilization	what fraction of the resource is busy?
S	Saturation	how much extra work is queued waiting for the resource?
E	Errors	what error events did the resource emit?

The USE method's clarity comes from a simple flowchart: for every resource, ask the three questions. If any answer is concerning, drill there first.

USE for an LLM serving box (CPU-only, Phase 34)¶

Resource	U	S	E
CPU	`node_cpu_seconds_total{mode="user\\|system"}` ratioed	`node_load1`	`node_cpu_seconds_total{mode="iowait"}` (proxy)
RAM	`node_memory_MemAvailable_bytes`	`node_memory_SwapUsed_bytes`	`node_vmstat_oom_kill`
Disk	`node_disk_io_time_seconds_total`	`node_disk_io_now`	`node_disk_io_errors_total` (if exposed)
Batcher queue	`llm_batcher_active_slots`	`llm_batcher_queue_depth`	`llm_batcher_admission_errors_total`
KV cache	`llm_kv_cache_slots_used`	`llm_kv_cache_evictions_total`	n/a (allocation failures = batcher admission error)

Most of the OS-level rows are provided "for free" by node_exporter. The LLM-specific rows — batcher queue, KV cache — you wire by hand in src/observability/metrics.py.

USE for a multi-GPU box (Phase 35 forward reference)¶

Adds:

Resource	U	S	E
GPU SM	`DCGM_FI_DEV_GPU_UTIL`	n/a (no queue concept)	`DCGM_FI_DEV_XID_ERRORS`
HBM bandwidth	`DCGM_FI_DEV_MEM_COPY_UTIL`	n/a	`DCGM_FI_DEV_MEM_ERRORS`
HBM capacity	`DCGM_FI_DEV_FB_USED / FB_TOTAL`	OOM events	OOM-killer events
NCCL collective	latency histogram	active-collective gauge	timeout counter

NVIDIA DCGM exporter is the standard. We'll wire it in Phase 35; not in scope here.

LLM-specific metrics that don't fit RED or USE¶

Five metrics sit in their own category:

tokens_total{kind="prompt|completion"} — counter. Cumulative tokens served. Used to compute throughput in tokens/sec and cost per 1k tokens.
time_to_first_token_seconds — histogram. Tail of this is what "feels slow" to a user during streaming.
inter_token_latency_seconds — histogram. Steady-state generation speed.
generation_length_tokens — histogram. Right-skewed distribution; lets you spot the "user asked for a 10k-token essay" outliers.
cost_per_request_usd — histogram. Covered in theory/02-cost-accounting.md.

These five plus RED plus USE give you the complete LLM dashboard. Six panels for RED, ~6 for USE, 5 for LLM-specifics. ~17 panels in one dashboard.

A worked example: what alerts should you write?¶

Three alerts that any LLM service should have on day one:

High error rate. PromQL: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01 for 5 minutes. Goal: catch a deployment that broke generation.
Slow TTFT p95. histogram_quantile(0.95, rate(time_to_first_token_seconds_bucket[5m])) > 2 for 5 minutes. Goal: catch prefill regressions.
Cost explosion. histogram_quantile(0.95, rate(cost_per_1k_completion_tokens_usd_bucket[15m])) > 0.05 for 15 minutes. Goal: catch a runaway prompt pattern.

These three are the on-call alerts for an LLM service.

Cardinality: the silent killer¶

Every label combination is a separate time series. Prometheus scales to ~1M active series on commodity hardware; past that, it falls over.

Labels with bounded cardinality (good):

endpoint (5–20 values)
method (4 values)
status (10s of values)
model_name (a handful)
tenant_id if you have a closed list

Labels that explode (bad):

user_id (unbounded)
prompt_hash (unbounded)
request_id (unbounded — this is what trace_id is for, not metrics)
prompt_first_100_chars (effectively unbounded)

Rule: if a label can take more than ~1000 distinct values across all time, it doesn't belong in a metric. It belongs in a log or a trace span attribute, where storage cost is linear in events not in series-count.

The Prometheus client mental model¶

For Borja's lab, the prometheus_client library exposes four metric types:

Counter: monotonically increasing. .inc(). Use for rates and totals.
Gauge: arbitrary value. .set(), .inc(), .dec(). Use for things that go up and down (queue depth, slots used).
Histogram: pre-bucketed. .observe(value). Use for latency, cost, anything where you want percentiles.
Summary: like histogram but with quantile estimation done client-side. Avoid — non-aggregable across instances. Use histograms.

For Phase 34 you use exactly three: Counter, Gauge, Histogram. Never Summary.

One-paragraph recap¶

RED (rate, errors, duration) characterizes a service by what its users see; USE (utilization, saturation, errors) characterizes a resource by what's pushing back from below. Together they're complete; alone they're each blind to half the picture. LLM serving adds five LLM-specific metrics (token counters, TTFT, ITL, generation length, cost-per-request) that don't fit cleanly into either. Histogram buckets must be LLM-shaped (sub-second to two-minute, 13 buckets). Labels must have bounded cardinality — per-user dimensions belong in traces/logs, not metrics.

Next: theory/02-cost-accounting.md.