English · Español
Lab 01 — Instrument the Phase 33 server with RED + USE + LLM metrics¶
Goal: the six core Prometheus metrics, wired into the Phase 33 server, scraped successfully.
Estimated time: 2-3 hours.
Prereq: Lab 00 stack up; Phase 33 server runnable locally.
What you produce¶
src/observability/__init__.pysrc/observability/metrics.py— the metric definitions + a small middleware that records them.- Phase 33 server modified to register the metrics middleware and expose
/metrics. - A Grafana dashboard "draft" (any layout — refine in lab 03) with at least one panel per metric.
The six metrics¶
Implement exactly these. No fewer, no more — additional metrics belong in lab 02 (LLM-specific, beyond the core six) or lab 03 (cost).
| Name | Type | Labels | What |
|---|---|---|---|
request_total |
Counter | endpoint, method, status |
RED: rate + errors |
request_duration_seconds |
Histogram (LLM buckets) | endpoint, method |
RED: duration |
tokens_total |
Counter | kind ∈ {prompt, completion}, model_name |
LLM: throughput input |
time_to_first_token_seconds |
Histogram (TTFT buckets) | model_name |
LLM: streaming UX |
kv_cache_slots_used |
Gauge | (no labels) | USE: KV saturation |
queue_depth |
Gauge | (no labels) | USE: batcher saturation |
Buckets:
LLM_LATENCY_BUCKETS = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, float("inf"))
TTFT_BUCKETS = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, float("inf"))
TODOs¶
Block A — src/observability/metrics.py¶
- Import
prometheus_client(already inpyproject.toml'sservegroup). - Define the six metrics as module-level globals, using the labels and bucket lists above.
- Export a
metrics_app(aprometheus_client.make_asgi_app()) so the Phase 33 FastAPI app can mount it at/metrics. - Write a
record_request(endpoint, method, status, duration_s)helper that does the two RED observations (counter + histogram).
Block B — wire into the Phase 33 server¶
- In
src/miniserve/app.py, add the metrics middleware:
@app.middleware("http")
async def observe(request, call_next):
t0 = time.perf_counter()
try:
response = await call_next(request)
status = response.status_code
except Exception:
status = 500
raise
finally:
record_request(request.url.path, request.method, str(status), time.perf_counter() - t0)
return response
- Mount the metrics endpoint:
- In the batcher (Phase 33's
src/miniserve/batcher.py), exposekv_cache_slots_usedandqueue_depthvia the gauges'.set_function(...)(pull-on-scrape) or.set(...)after each state change (push-on-event). Pull-on-scrape is simpler — use it.
Block C — observe tokens¶
- In the prompt-tokenization step, after counting input tokens:
tokens_total.labels(kind="prompt", model_name=model).inc(n_tokens). - In the decode loop, on every emitted token:
tokens_total.labels(kind="completion", model_name=model).inc(1). - On the first emitted token of a request, record TTFT:
time_to_first_token_seconds.labels(model_name=model).observe(time.perf_counter() - request_start).
Block D — verify in Prometheus¶
- Start the Phase 33 server.
-
curl http://localhost:8000/metrics— should return ~30 lines of Prometheus exposition. - In Prometheus UI: query
request_total— should be > 0 after a fewcurl /v1/completionscalls. - Query
histogram_quantile(0.95, sum by(le) (rate(request_duration_seconds_bucket[1m])))— should return a number.
Block E — Grafana dashboard skeleton¶
- Create a new dashboard in Grafana titled "lynx-cortex LLM serving".
- Add six panels, one per metric. Any layout. Lab 03 polishes.
- Save dashboard. Export JSON. Commit to
infra/grafana/dashboards/llm.json.
Constraints¶
- One global registry.
prometheus_client.REGISTRY(the default). Do not create a customCollectorRegistry— multi-registry is for libraries, not application code. - No per-request labels in metrics. No
user_id, noprompt_hash. Cardinality rule from theory file 01. - Counters are monotonic. If you find yourself wanting
counter.set(0), you want a gauge. - No Summaries. Histograms only.
Stop conditions¶
Done when:
src/observability/metrics.pyexists and exports the six metrics + therecord_requesthelper + themetrics_app./metricsendpoint on the running Phase 33 server returns valid Prometheus exposition.- Prometheus targets page shows
miniserveas UP. - PromQL
request_total{status="200"} > 0returns truthy after a single load test (for i in $(seq 1 20); do curl ...; done). - Grafana dashboard saved + exported + committed.
Pitfalls (read before debugging)¶
/metrics404. Most likely the mount happened after the FastAPI app started serving, or you usedapp.add_routeinstead ofapp.mount. Mount the ASGI sub-app beforeuvicorn.run.- Histogram observations not appearing. Check the metric name in PromQL — Prometheus auto-adds
_bucket,_sum,_countsuffixes. The base name (request_duration_seconds) without a suffix returns nothing;request_duration_seconds_countreturns the observation count. - High cardinality warning from Prometheus. Logs say "scrape took N seconds" or "discarding sample with reset value". Check label cardinality. Most common cause: forgetting that FastAPI's
request.url.pathincludes path parameters — for/v1/users/42, you get one series per user id. Either strip path parameters or use the route's pattern (/v1/users/{id}). - Gauge not updating. If using
.set_function(), the function is called on every scrape. If it raises, the metric is silently dropped. Wrap intry/exceptand log.
When to consult solutions/¶
After all five Stop conditions pass. Solution at solutions/01-instrument-server-ref.md.
Next lab: lab/02-tracing-end-to-end.md.