English · Español
03 — Tracing and structured logging¶
🇪🇸 Métricas resumen, trazas explican. Una métrica te dice "el p95 subió"; una traza te dice "esta petición concreta esperó 4 segundos en la cola del batcher porque el slot del KV cache estaba lleno". Los logs estructurados son el pegamento —
trace_idune todo.
Metrics aggregate; they aren't pinpoint. When p95 latency rises, metrics tell you it rose; they don't tell you why this particular request was slow. That's tracing's job.
A trace is a causally-linked tree of spans, each span representing a unit of work with a start time, a duration, and key-value attributes. The trace tree shows how one logical request flowed through the stages of the service.
Anatomy of a span¶
A span is a 4-tuple:
{
"trace_id": "<128-bit hex>", # same across all spans in the trace
"span_id": "<64-bit hex>", # unique to this span
"parent_span_id": "<64-bit hex>", # the span that caused this one
"name": "decode_step",
"start_unix_nano": 1716_xxx_xxx_xxx,
"duration_nano": 234_000_000,
"attributes": {"llm.model": "minigpt-v1", "llm.tokens.in": 128, "llm.tokens.out": 7},
"events": [{...}], # discrete events within the span
"status": {"code": "OK"},
}
OpenTelemetry (OTel) defines this schema; the Python SDK provides tracer.start_as_current_span(...) to emit one.
The trace tree for an LLM request¶
A reasonable span hierarchy for a Phase 33 request:
http.handler (root)
├── auth (10 ms)
├── tokenize.input (3 ms; attrs: tokens.in=128)
├── rag.retrieve (45 ms; attrs: docs.retrieved=3)
│ ├── embed.query (15 ms)
│ ├── vector.search (8 ms; attrs: top_k=10)
│ └── rerank (20 ms)
├── llm.prefill (180 ms; attrs: tokens.in=128, cache_misses=2)
├── llm.decode (2400 ms)
│ ├── decode.step (10 ms; one span per step OR sampled)
│ ├── decode.step (10 ms)
│ └── ...
└── format.output (5 ms)
This is roughly the span tree Phase 34's tracing.py builds.
Practical caveat: per-step spans explode. A 240-token generation = 240 decode.step spans = bloat. Two options:
- Sample. Emit
decode.stepspans for 1% of decodes. The aggregatellm.decodeparent span still captures total duration. - Aggregate. Emit a single
llm.decodespan withattributes={"steps": 240, "mean_step_ms": 10.0, "max_step_ms": 18.2}.
We use (2) by default; (1) is the lab's optional extension.
Context propagation¶
The hard part of OTel in Python is making sure the trace_id and parent span_id propagate correctly across:
- async boundaries (
await), - task creation (
asyncio.create_task), - thread boundaries (
run_in_executor), - subprocess invocations (not in Phase 34 — that's Phase 31 forward),
- HTTP calls (the OTel HTTP instrumentation handles this for clients/servers).
The default start_as_current_span uses Python's contextvars, which propagates across await correctly but not across asyncio.create_task unless you wrap it. The Phase 33 server uses asyncio; the lab forces Borja to think through one explicit case (the streaming EventSourceResponse for SSE) where context propagation breaks if uncorrected.
Diagnostic: orphan spans (spans whose parent_span_id is unset, but whose name implies they should be children) are the symptom. Tempo/Jaeger renders these as floating roots; they should be zero in a correct setup.
Cardinality and sampling for traces¶
Unlike metrics, traces can afford per-request labels — trace_id itself is the per-request label, and the storage backend (Tempo) indexes them efficiently. But you still don't want infinite trace volume.
Sampling strategy: for the curriculum's volume (Phase 34 load tests ≤100 RPS for 5 minutes = ~30k traces), keep 100%. In production, "head-based" sampling (decide at the root span whether to keep) is the default; "tail-based" (decide after seeing the whole trace) is the smart-but-expensive variant.
For Phase 37 forward reference: tail-based sampling lets you keep 100% of failed or anomalous traces while throwing away most successful ones. That's how you preserve the prompt-injection attack traces for forensics without blowing up the trace store.
Structured logging¶
Logs are not strings; they are events with key-value payloads, serialized as JSON, one line per event.
structlog is the library. The wiring:
import structlog, logging
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
add_trace_context, # custom: pulls trace_id/span_id from OTel
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
],
)
log = structlog.get_logger()
log.info("request.received", endpoint="/v1/completions", model="minigpt-v1")
# → {"timestamp": "2026-05-22T...", "level": "info", "trace_id": "...", "span_id": "...", "event": "request.received", "endpoint": "/v1/completions", "model": "minigpt-v1"}
Every log line carries trace_id and span_id. This is the magic that lets you go from a Grafana dashboard (a metric spike at 14:32) to the affected trace (filter Tempo by service.name="miniserve" in the time range) to the underlying log lines for that trace.
What to log, what not to log¶
Log:
- Every request received (event=
request.received, with method, endpoint, model, prompt-token-count). - Every error (event=
request.failed, with error class, stage, prompt-token-count). - Significant state changes (event=
batcher.slot.evicted, event=kv_cache.full).
Do NOT log:
- The prompt body (PII; sensitive; large). Goes in the trace span attribute under a feature flag — never in the metric, never in the log.
- The completion body. Same.
- Per-token events. Way too noisy.
- Health-check pings. Filter at the access-log layer.
The lab makes Borja write a log_redactor step that scrubs anything matching prompt, completion, messages, text from log payloads. Default-deny.
The triad: metrics + traces + logs joined¶
The three signals are joined by trace_id (and only by trace_id). Once you have it:
- Click a metric spike in Grafana → query Tempo for traces in that time window → pick a trace → see the span tree → click a span → query Loki/structured logs for log lines with that
trace_id.
The chain is one tool (Grafana) with three data sources. Phase 34's infra/compose/observability.yml brings up all three locally.
Langfuse: when and why¶
Langfuse is a self-hostable LLM-specific tracing UI. It overlaps with Tempo + Loki but adds:
- Native rendering of prompts and completions side-by-side.
- Token-level metadata as a first-class concept.
- Built-in eval scoring (LLM-as-judge) attached to traces.
For Phase 34, Langfuse is optional. The OTel + Tempo + Loki stack is sufficient for the DoD. Langfuse is the right tool if Phase 37 (prompt injection) wants to use the trace store as a regression-test fixture — and Phase 37's lab does plan to use it that way. Decision point: add Langfuse to the compose file in Phase 34 or in Phase 37. Default: defer to Phase 37.
A worked example: tracing a real bug¶
Suppose users complain that one in fifty completions is empty (zero tokens). Metrics-only investigation:
rate(http_requests_total{status="200"}[5m])confirms most are 200.histogram_quantile(0.5, generation_length_tokens_bucket)is normal.- But
histogram_quantile(0.01, generation_length_tokens_bucket)shows a spike at 0 tokens. So the bug is real, and it's in the left tail.
Without traces, you have no way to find a representative bad request. With traces:
- In Tempo, search for spans where
llm.tokens.out = 0. Get 30 traces. - Pick one. See:
llm.decodespan finished after 5 ms — way too fast. - Look at span attributes:
stop_reason="immediate_eos". - Filter logs by
trace_idof that trace. See:kv_cache.full evictionevent in the middle of prefill, which corrupted the cache state for that batch slot. - Root cause found in 5 minutes. Without tracing, this is a multi-day investigation.
This kind of investigation is the entire reason tracing exists. Make sure Borja experiences at least one of them in the lab.
One-paragraph recap¶
Tracing builds a causal span tree per request via OpenTelemetry; structured logging emits JSON events keyed by trace_id/span_id; the two join via that id, so a metric spike → relevant traces → relevant log lines is a single workflow. Context propagation in Python requires care across asyncio.create_task and thread boundaries; orphan spans are the diagnostic. Don't log prompts or completions (PII + bloat); keep those in trace span attributes under a feature flag. Langfuse is optional — defer to Phase 37 unless the LLM-specific UI is needed sooner.
Done with theory. On to lab/00-prom-grafana-up.md.