English · Español

03 — Tracing and structured logging¶

🇪🇸 Métricas resumen, trazas explican. Una métrica te dice "el p95 subió"; una traza te dice "esta petición concreta esperó 4 segundos en la cola del batcher porque el slot del KV cache estaba lleno". Los logs estructurados son el pegamento — trace_id une todo.

Metrics aggregate; they aren't pinpoint. When p95 latency rises, metrics tell you it rose; they don't tell you why this particular request was slow. That's tracing's job.

A trace is a causally-linked tree of spans, each span representing a unit of work with a start time, a duration, and key-value attributes. The trace tree shows how one logical request flowed through the stages of the service.

Anatomy of a span¶

A span is a 4-tuple:

{
  "trace_id": "<128-bit hex>",   # same across all spans in the trace
  "span_id":  "<64-bit hex>",     # unique to this span
  "parent_span_id": "<64-bit hex>",  # the span that caused this one
  "name": "decode_step",
  "start_unix_nano": 1716_xxx_xxx_xxx,
  "duration_nano": 234_000_000,
  "attributes": {"llm.model": "minigpt-v1", "llm.tokens.in": 128, "llm.tokens.out": 7},
  "events":  [{...}],  # discrete events within the span
  "status":  {"code": "OK"},
}

OpenTelemetry (OTel) defines this schema; the Python SDK provides tracer.start_as_current_span(...) to emit one.

The trace tree for an LLM request¶

A reasonable span hierarchy for a Phase 33 request:

http.handler              (root)
├── auth                  (10 ms)
├── tokenize.input        (3 ms; attrs: tokens.in=128)
├── rag.retrieve          (45 ms; attrs: docs.retrieved=3)
│   ├── embed.query       (15 ms)
│   ├── vector.search     (8 ms; attrs: top_k=10)
│   └── rerank            (20 ms)
├── llm.prefill           (180 ms; attrs: tokens.in=128, cache_misses=2)
├── llm.decode            (2400 ms)
│   ├── decode.step       (10 ms; one span per step OR sampled)
│   ├── decode.step       (10 ms)
│   └── ...
└── format.output         (5 ms)

This is roughly the span tree Phase 34's tracing.py builds.

Practical caveat: per-step spans explode. A 240-token generation = 240 decode.step spans = bloat. Two options:

Sample. Emit decode.step spans for 1% of decodes. The aggregate llm.decode parent span still captures total duration.
Aggregate. Emit a single llm.decode span with attributes={"steps": 240, "mean_step_ms": 10.0, "max_step_ms": 18.2}.

We use (2) by default; (1) is the lab's optional extension.

Context propagation¶

The hard part of OTel in Python is making sure the trace_id and parent span_id propagate correctly across:

async boundaries (await),
task creation (asyncio.create_task),
thread boundaries (run_in_executor),
subprocess invocations (not in Phase 34 — that's Phase 31 forward),
HTTP calls (the OTel HTTP instrumentation handles this for clients/servers).

The default start_as_current_span uses Python's contextvars, which propagates across await correctly but not across asyncio.create_task unless you wrap it. The Phase 33 server uses asyncio; the lab forces Borja to think through one explicit case (the streaming EventSourceResponse for SSE) where context propagation breaks if uncorrected.

Diagnostic: orphan spans (spans whose parent_span_id is unset, but whose name implies they should be children) are the symptom. Tempo/Jaeger renders these as floating roots; they should be zero in a correct setup.

Cardinality and sampling for traces¶

Unlike metrics, traces can afford per-request labels — trace_id itself is the per-request label, and the storage backend (Tempo) indexes them efficiently. But you still don't want infinite trace volume.

Sampling strategy: for the curriculum's volume (Phase 34 load tests ≤100 RPS for 5 minutes = ~30k traces), keep 100%. In production, "head-based" sampling (decide at the root span whether to keep) is the default; "tail-based" (decide after seeing the whole trace) is the smart-but-expensive variant.

For Phase 37 forward reference: tail-based sampling lets you keep 100% of failed or anomalous traces while throwing away most successful ones. That's how you preserve the prompt-injection attack traces for forensics without blowing up the trace store.

Structured logging¶

Logs are not strings; they are events with key-value payloads, serialized as JSON, one line per event.

structlog is the library. The wiring:

import structlog, logging
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        add_trace_context,                       # custom: pulls trace_id/span_id from OTel
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ],
)
log = structlog.get_logger()
log.info("request.received", endpoint="/v1/completions", model="minigpt-v1")
# → {"timestamp": "2026-05-22T...", "level": "info", "trace_id": "...", "span_id": "...", "event": "request.received", "endpoint": "/v1/completions", "model": "minigpt-v1"}

Every log line carries trace_id and span_id. This is the magic that lets you go from a Grafana dashboard (a metric spike at 14:32) to the affected trace (filter Tempo by service.name="miniserve" in the time range) to the underlying log lines for that trace.

What to log, what not to log¶

Log:

Every request received (event=request.received, with method, endpoint, model, prompt-token-count).
Every error (event=request.failed, with error class, stage, prompt-token-count).
Significant state changes (event=batcher.slot.evicted, event=kv_cache.full).

Do NOT log:

The prompt body (PII; sensitive; large). Goes in the trace span attribute under a feature flag — never in the metric, never in the log.
The completion body. Same.
Per-token events. Way too noisy.
Health-check pings. Filter at the access-log layer.

The lab makes Borja write a log_redactor step that scrubs anything matching prompt, completion, messages, text from log payloads. Default-deny.

The triad: metrics + traces + logs joined¶

The three signals are joined by trace_id (and only by trace_id). Once you have it:

Click a metric spike in Grafana → query Tempo for traces in that time window → pick a trace → see the span tree → click a span → query Loki/structured logs for log lines with that trace_id.

The chain is one tool (Grafana) with three data sources. Phase 34's infra/compose/observability.yml brings up all three locally.

Langfuse: when and why¶

Langfuse is a self-hostable LLM-specific tracing UI. It overlaps with Tempo + Loki but adds:

Native rendering of prompts and completions side-by-side.
Token-level metadata as a first-class concept.
Built-in eval scoring (LLM-as-judge) attached to traces.

For Phase 34, Langfuse is optional. The OTel + Tempo + Loki stack is sufficient for the DoD. Langfuse is the right tool if Phase 37 (prompt injection) wants to use the trace store as a regression-test fixture — and Phase 37's lab does plan to use it that way. Decision point: add Langfuse to the compose file in Phase 34 or in Phase 37. Default: defer to Phase 37.

A worked example: tracing a real bug¶

Suppose users complain that one in fifty completions is empty (zero tokens). Metrics-only investigation:

rate(http_requests_total{status="200"}[5m]) confirms most are 200.
histogram_quantile(0.5, generation_length_tokens_bucket) is normal.
But histogram_quantile(0.01, generation_length_tokens_bucket) shows a spike at 0 tokens. So the bug is real, and it's in the left tail.

Without traces, you have no way to find a representative bad request. With traces:

In Tempo, search for spans where llm.tokens.out = 0. Get 30 traces.
Pick one. See: llm.decode span finished after 5 ms — way too fast.
Look at span attributes: stop_reason="immediate_eos".
Filter logs by trace_id of that trace. See: kv_cache.full eviction event in the middle of prefill, which corrupted the cache state for that batch slot.
Root cause found in 5 minutes. Without tracing, this is a multi-day investigation.

This kind of investigation is the entire reason tracing exists. Make sure Borja experiences at least one of them in the lab.

One-paragraph recap¶

Tracing builds a causal span tree per request via OpenTelemetry; structured logging emits JSON events keyed by trace_id/span_id; the two join via that id, so a metric spike → relevant traces → relevant log lines is a single workflow. Context propagation in Python requires care across asyncio.create_task and thread boundaries; orphan spans are the diagnostic. Don't log prompts or completions (PII + bloat); keep those in trace span attributes under a feature flag. Langfuse is optional — defer to Phase 37 unless the LLM-specific UI is needed sooner.

Done with theory. On to lab/00-prom-grafana-up.md.