English · Español

Lab 01 — Instrument the Phase 33 server with RED + USE + LLM metrics¶

Goal: the six core Prometheus metrics, wired into the Phase 33 server, scraped successfully.

Estimated time: 2-3 hours.

Prereq: Lab 00 stack up; Phase 33 server runnable locally.

What you produce¶

src/observability/__init__.py
src/observability/metrics.py — the metric definitions + a small middleware that records them.
Phase 33 server modified to register the metrics middleware and expose /metrics.
A Grafana dashboard "draft" (any layout — refine in lab 03) with at least one panel per metric.

The six metrics¶

Implement exactly these. No fewer, no more — additional metrics belong in lab 02 (LLM-specific, beyond the core six) or lab 03 (cost).

Name	Type	Labels	What
`request_total`	Counter	`endpoint`, `method`, `status`	RED: rate + errors
`request_duration_seconds`	Histogram (LLM buckets)	`endpoint`, `method`	RED: duration
`tokens_total`	Counter	`kind` ∈ `{prompt, completion}`, `model_name`	LLM: throughput input
`time_to_first_token_seconds`	Histogram (TTFT buckets)	`model_name`	LLM: streaming UX
`kv_cache_slots_used`	Gauge	(no labels)	USE: KV saturation
`queue_depth`	Gauge	(no labels)	USE: batcher saturation

Buckets:

LLM_LATENCY_BUCKETS = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 20, 30, 60, 120, float("inf"))
TTFT_BUCKETS = (0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, float("inf"))

TODOs¶

Block A — `src/observability/metrics.py`¶

Import prometheus_client (already in pyproject.toml's serve group).
Define the six metrics as module-level globals, using the labels and bucket lists above.
Export a metrics_app (a prometheus_client.make_asgi_app()) so the Phase 33 FastAPI app can mount it at /metrics.
Write a record_request(endpoint, method, status, duration_s) helper that does the two RED observations (counter + histogram).

Block B — wire into the Phase 33 server¶

In src/miniserve/app.py, add the metrics middleware:

@app.middleware("http")
async def observe(request, call_next):
    t0 = time.perf_counter()
    try:
        response = await call_next(request)
        status = response.status_code
    except Exception:
        status = 500
        raise
    finally:
        record_request(request.url.path, request.method, str(status), time.perf_counter() - t0)
    return response

Mount the metrics endpoint:

app.mount("/metrics", metrics_app)

In the batcher (Phase 33's src/miniserve/batcher.py), expose kv_cache_slots_used and queue_depth via the gauges' .set_function(...) (pull-on-scrape) or .set(...) after each state change (push-on-event). Pull-on-scrape is simpler — use it.

Block C — observe tokens¶

In the prompt-tokenization step, after counting input tokens: tokens_total.labels(kind="prompt", model_name=model).inc(n_tokens).
In the decode loop, on every emitted token: tokens_total.labels(kind="completion", model_name=model).inc(1).
On the first emitted token of a request, record TTFT: time_to_first_token_seconds.labels(model_name=model).observe(time.perf_counter() - request_start).

Block D — verify in Prometheus¶

Start the Phase 33 server.
curl http://localhost:8000/metrics — should return ~30 lines of Prometheus exposition.
In Prometheus UI: query request_total — should be > 0 after a few curl /v1/completions calls.
Query histogram_quantile(0.95, sum by(le) (rate(request_duration_seconds_bucket[1m]))) — should return a number.

Block E — Grafana dashboard skeleton¶

Create a new dashboard in Grafana titled "lynx-cortex LLM serving".
Add six panels, one per metric. Any layout. Lab 03 polishes.
Save dashboard. Export JSON. Commit to infra/grafana/dashboards/llm.json.

Constraints¶

One global registry. prometheus_client.REGISTRY (the default). Do not create a custom CollectorRegistry — multi-registry is for libraries, not application code.
No per-request labels in metrics. No user_id, no prompt_hash. Cardinality rule from theory file 01.
Counters are monotonic. If you find yourself wanting counter.set(0), you want a gauge.
No Summaries. Histograms only.

Stop conditions¶

Done when:

src/observability/metrics.py exists and exports the six metrics + the record_request helper + the metrics_app.
/metrics endpoint on the running Phase 33 server returns valid Prometheus exposition.
Prometheus targets page shows miniserve as UP.
PromQL request_total{status="200"} > 0 returns truthy after a single load test (for i in $(seq 1 20); do curl ...; done).
Grafana dashboard saved + exported + committed.

Pitfalls (read before debugging)¶

/metrics 404. Most likely the mount happened after the FastAPI app started serving, or you used app.add_route instead of app.mount. Mount the ASGI sub-app before uvicorn.run.
Histogram observations not appearing. Check the metric name in PromQL — Prometheus auto-adds _bucket, _sum, _count suffixes. The base name (request_duration_seconds) without a suffix returns nothing; request_duration_seconds_count returns the observation count.
High cardinality warning from Prometheus. Logs say "scrape took N seconds" or "discarding sample with reset value". Check label cardinality. Most common cause: forgetting that FastAPI's request.url.path includes path parameters — for /v1/users/42, you get one series per user id. Either strip path parameters or use the route's pattern (/v1/users/{id}).
Gauge not updating. If using .set_function(), the function is called on every scrape. If it raises, the metric is silently dropped. Wrap in try/except and log.

When to consult `solutions/`¶

After all five Stop conditions pass. Solution at solutions/01-instrument-server-ref.md.

Next lab: lab/02-tracing-end-to-end.md.