Skip to content

English · Español

Lab 01 — Instrument the Phase 33 server with RED + USE + LLM metrics

Goal: the six core Prometheus metrics, wired into the Phase 33 server, scraped successfully.

Estimated time: 2-3 hours.

Prereq: Lab 00 stack up; Phase 33 server runnable locally.


What you produce

  • src/observability/__init__.py
  • src/observability/metrics.py — the metric definitions + a small middleware that records them.
  • Phase 33 server modified to register the metrics middleware and expose /metrics.
  • A Grafana dashboard "draft" (any layout — refine in lab 03) with at least one panel per metric.

The six metrics

Implement exactly these. No fewer, no more — additional metrics belong in lab 02 (LLM-specific, beyond the core six) or lab 03 (cost).

Name Type Labels What
request_total Counter endpoint, method, status RED: rate + errors
request_duration_seconds Histogram (LLM buckets) endpoint, method RED: duration
tokens_total Counter kind{prompt, completion}, model_name LLM: throughput input
time_to_first_token_seconds Histogram (TTFT buckets) model_name LLM: streaming UX
kv_cache_slots_used Gauge (no labels) USE: KV saturation
queue_depth Gauge (no labels) USE: batcher saturation

Buckets:

LLM_LATENCY_BUCKETS = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, float("inf"))
TTFT_BUCKETS = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, float("inf"))

TODOs

Block A — src/observability/metrics.py

  • Import prometheus_client (already in pyproject.toml's serve group).
  • Define the six metrics as module-level globals, using the labels and bucket lists above.
  • Export a metrics_app (a prometheus_client.make_asgi_app()) so the Phase 33 FastAPI app can mount it at /metrics.
  • Write a record_request(endpoint, method, status, duration_s) helper that does the two RED observations (counter + histogram).

Block B — wire into the Phase 33 server

  • In src/miniserve/app.py, add the metrics middleware:
@app.middleware("http")
async def observe(request, call_next):
    t0 = time.perf_counter()
    try:
        response = await call_next(request)
        status = response.status_code
    except Exception:
        status = 500
        raise
    finally:
        record_request(request.url.path, request.method, str(status), time.perf_counter() - t0)
    return response
  • Mount the metrics endpoint:
app.mount("/metrics", metrics_app)
  • In the batcher (Phase 33's src/miniserve/batcher.py), expose kv_cache_slots_used and queue_depth via the gauges' .set_function(...) (pull-on-scrape) or .set(...) after each state change (push-on-event). Pull-on-scrape is simpler — use it.

Block C — observe tokens

  • In the prompt-tokenization step, after counting input tokens: tokens_total.labels(kind="prompt", model_name=model).inc(n_tokens).
  • In the decode loop, on every emitted token: tokens_total.labels(kind="completion", model_name=model).inc(1).
  • On the first emitted token of a request, record TTFT: time_to_first_token_seconds.labels(model_name=model).observe(time.perf_counter() - request_start).

Block D — verify in Prometheus

  • Start the Phase 33 server.
  • curl http://localhost:8000/metrics — should return ~30 lines of Prometheus exposition.
  • In Prometheus UI: query request_total — should be > 0 after a few curl /v1/completions calls.
  • Query histogram_quantile(0.95, sum by(le) (rate(request_duration_seconds_bucket[1m]))) — should return a number.

Block E — Grafana dashboard skeleton

  • Create a new dashboard in Grafana titled "lynx-cortex LLM serving".
  • Add six panels, one per metric. Any layout. Lab 03 polishes.
  • Save dashboard. Export JSON. Commit to infra/grafana/dashboards/llm.json.

Constraints

  • One global registry. prometheus_client.REGISTRY (the default). Do not create a custom CollectorRegistry — multi-registry is for libraries, not application code.
  • No per-request labels in metrics. No user_id, no prompt_hash. Cardinality rule from theory file 01.
  • Counters are monotonic. If you find yourself wanting counter.set(0), you want a gauge.
  • No Summaries. Histograms only.

Stop conditions

Done when:

  1. src/observability/metrics.py exists and exports the six metrics + the record_request helper + the metrics_app.
  2. /metrics endpoint on the running Phase 33 server returns valid Prometheus exposition.
  3. Prometheus targets page shows miniserve as UP.
  4. PromQL request_total{status="200"} > 0 returns truthy after a single load test (for i in $(seq 1 20); do curl ...; done).
  5. Grafana dashboard saved + exported + committed.

Pitfalls (read before debugging)

  • /metrics 404. Most likely the mount happened after the FastAPI app started serving, or you used app.add_route instead of app.mount. Mount the ASGI sub-app before uvicorn.run.
  • Histogram observations not appearing. Check the metric name in PromQL — Prometheus auto-adds _bucket, _sum, _count suffixes. The base name (request_duration_seconds) without a suffix returns nothing; request_duration_seconds_count returns the observation count.
  • High cardinality warning from Prometheus. Logs say "scrape took N seconds" or "discarding sample with reset value". Check label cardinality. Most common cause: forgetting that FastAPI's request.url.path includes path parameters — for /v1/users/42, you get one series per user id. Either strip path parameters or use the route's pattern (/v1/users/{id}).
  • Gauge not updating. If using .set_function(), the function is called on every scrape. If it raises, the metric is silently dropped. Wrap in try/except and log.

When to consult solutions/

After all five Stop conditions pass. Solution at solutions/01-instrument-server-ref.md.


Next lab: lab/02-tracing-end-to-end.md.