Skip to content

English · Español

Theory 02 — End-to-end data flow: one request, every layer

🇪🇸 Una sola petición HTTP del tutor de gramática atraviesa nueve capas, desde el accept(2) del kernel hasta el JSON en el socket de respuesta. Aquí seguimos cada byte: cuántos, en qué forma, en qué etapa del latency budget y con qué percentil. Sumar p95 por etapa no da el p95 total — esa falacia se demuestra al final del capítulo.

Why a byte-level walkthrough matters

Phases 11–34 each built one stage of the pipeline in isolation. The grammar tutor's request path is the first time a learner sees them composed. Three things break under composition that no single-stage view exposes:

  1. Byte format mismatches. Phase 11's BPE emits list[int]; Phase 17's Mini-GPT consumes torch.LongTensor of shape (batch, seq_len). The list[int] → tensor cast is somewhere; if it's in the wrong place, you allocate twice.
  2. Latency budget collisions. Each stage thinks it has 500 ms. The user has 5 s total. Math doesn't add up unless someone audits the sum, which is Phase 39's job.
  3. Percentile arithmetic. Engineers fluent in percentiles still get this wrong: \(p_{95}(A + B) \ne p_{95}(A) + p_{95}(B)\). The capstone dashboard's per-stage p95 panel is informational; only the end-to-end p95 (computed on full traces) is load-bearing.

This chapter walks one request — POST /v1/grammar/correct with body {"sentence": "Yesterday I goed to the store"} — through every layer with concrete numbers.

The nine layers

# Layer Phase Input Output Typical bytes Typical ms
1 HTTP ingress (FastAPI + uvicorn) 33 TCP bytes Request object ~600 B in, ~400 B out 0.5
2 Schema validation (Pydantic + Phase 30 output schema) 30 Request body CorrectRequest pydantic n/a 0.3
3 Security guards (rate-limit, body-size, injection filter) 33, 37 CorrectRequest CorrectRequest (or 4xx) n/a 0.2
4 BPE tokenization 11 str ("Yesterday I goed...") list[int] of length ~10 ~30 B → 80 B (int64) 1.0
5 RAG retrieval (bi-encoder + BM25 hybrid) 29 list[int] or raw text top-5 chunks (~2 KB) ~80 B → ~2 KB 25
6 Mini-GPT prefill (encode prompt + context) 17, 22 (1, ~80) int64 (1, 80, |V|) float32 ~80 B → ~5 MB logits + ~3 MB KV cache 800
7 Mini-GPT decode (generate ~30 tokens) 17, 21, 22 KV cache (1, ~30) int64 + 30 × KV-cache appends ~30 × 100 KB = 3 MB 2,500
8 Output formatting (structured JSON via Phase 30 schema) 30 (1, ~30) int64 + BPE decode CorrectResponse JSON ~80 B → ~400 B 2.0
9 Cost emission + trace flush + response write 34 CorrectResponse TCP bytes 400 B out + ~2 KB metrics 1.5

Total wall time (typical, single request, warm cache): ≈ 3,330 ms.

Per-target: the DoD requires p95 < 5,000 ms under 10 concurrent requests (Plan §7). The 3.3 s typical leaves a 1.7 s buffer for tail latency and queueing. Phase 33's continuous-batching theory says queue wait dominates the tail; the capstone dashboard's queue depth panel is the leading indicator.

Where the bytes are

Two patterns to internalize:

  • Inputs are tiny, internal state is large. Request body: 600 B. KV cache during decode: ~3 MB. That ratio (5000×) explains why Phase 22 (KV cache) matters: the cheap thing to optimize would be the bytes the user sends; the valuable thing to optimize is the bytes the model keeps in RAM.
  • The dominant cost is decode, not prefill. Prefill is 800 ms for the whole prompt at once; decode is 2,500 ms for 30 tokens one at a time. The decode-dominance is intrinsic to autoregressive generation and is the headline finding of Phase 21's cost model.

The latency budget

Allocate 5,000 ms across stages. Phase 39's Plan §2 says: allocate proportional to \(\sigma_i\) (each stage's standard deviation). In practice for a CPU-only single-node demo:

Stage Allocation Justification
HTTP + schema + guards 50 ms Low variance; near-constant on warm process
BPE tokenize 20 ms Low variance; pure-Python loop, but bounded
RAG retrieve 200 ms Moderate variance; index size affects tail
Mini-GPT prefill 1,200 ms High variance with sequence length
Mini-GPT decode 3,300 ms Dominant. Variance with output length
Format + cost + flush 30 ms Low variance
Buffer (queueing, GC pauses) 200 ms Slack
Total budget 5,000 ms matches DoD target

The Grafana dashboard's latency budget panel renders this as a stacked bar; the actual timing per request is rendered below it. Visual diffs (budget vs actual) catch budget violations within seconds of regression.

What "budget" actually means

A latency budget is not "no stage will ever exceed its allocation." It's "if a stage exceeds repeatedly, that's a debt to pay down." The dashboard's budget burn rate metric (stage time / allocation) is monitored; sustained burn > 1.5× for any stage triggers a Phase-40 carry-over.

The percentile-addition fallacy

A common bug in performance dashboards:

"Each stage's p95 is shown. The sum of p95s is reported as 'end-to-end p95'."

This is wrong. Mathematically:

\[p_{95}\left(\sum_{i} X_i\right) \ne \sum_{i} p_{95}(X_i)\]

The intuition: the p95 of stage A and the p95 of stage B are typically observed on different requests. The request that's slow at stage A is not the same request that's slow at stage B. Summing per-stage p95s assumes pessimistic correlation; the actual end-to-end distribution has lighter tails.

A concrete demonstration

Take two stages, each with latency uniformly distributed in \([100, 1000]\) ms:

  • \(p_{95}(A) = 955\) ms
  • \(p_{95}(B) = 955\) ms
  • Naïve sum: \(1910\) ms
  • Actual \(p_{95}(A + B)\) (via simulation of \(10^6\) samples): ≈ \(1690\) ms

The naïve estimate overstates the real p95 by ~13%. In a dashboard with 9 stages, the overstatement compounds. Engineers see "tail latency" that isn't actually there and chase the wrong optimization.

Mitigation: the dashboard computes end-to-end percentiles from full request traces (Tempo / Jaeger), not from per-stage summary statistics. Per-stage p95 is informational; it tells you which stage is consistently slowest, not what the end-to-end tail is.

Lab 01 includes a one-shot script that fetches recent traces, computes both numbers, and prints their delta. The first run typically shows a >10% delta — the capstone makes this visible.

Trace propagation: making the picture work

The dashboard's end-to-end latency view depends on every stage emitting a span with the same trace_id and a parent-child relationship. The contract from Theory 01:

Stage Emits span Parent
1 — HTTP ingress http.request (root)
2 — Validation validate.body http.request
3 — Guards security.check http.request
4 — BPE tokenize.bpe http.request
5 — RAG retrieve.hybrid http.request
6 — Prefill model.prefill http.request
7 — Decode model.decode http.request
8 — Format format.json http.request
9 — Cost cost.emit http.request

Every span carries request_id as an attribute. The Phase 34 cost emitter writes the cost as a span attribute (not a separate metric line) so the trace and the cost are joined on a single ID without a downstream JOIN.

Cross-process boundary: the MCP tool

If the request triggers an MCP tool call (rare in the grammar tutor path; common in the security run-through of Lab 03), the child process must inherit the trace context. Mechanism:

  1. Parent serializes traceparent + tracestate into env vars (W3C trace-context spec).
  2. Subprocess reads them on startup and re-establishes the span context.
  3. Subprocess emits its spans as children of the parent.

Phase 31's tool blueprint encodes this; Phase 39 verifies it in Lab 03 by asserting the malicious-payload span shows as a child of the demo's request span.

Failure mode (Pitfall 5 from the Plan): if TRACEPARENT isn't propagated, the subprocess emits orphan spans. The dashboard's "Orphan span count" panel fires an alert. Lab 01 has a one-line audit.

The byte journey, condensed

A 600-byte request becomes:

  • 30 B of UTF-8 sentence text (layer 4 input).
  • 80 B of int64 token IDs (layer 4 output, layer 6 input).
  • 5 MB of float32 logits (layer 6 internal).
  • 3 MB of KV cache (layer 6 / 7 internal).
  • 400 B of JSON response (layer 8 output, layer 9 input).
  • 600 B of HTTP response + 2 KB of metrics + ~5 KB of trace data (layer 9 output).

The 5,000× amplification from input to peak internal state is the raison d'être of every memory-aware optimization the curriculum touched: tokenizer compactness (Phase 11), quantization (Phase 26), KV cache sharing (Phase 22), continuous batching (Phase 33). The capstone makes this amplification visible.

What this theory does NOT cover

  • The internals of any single stage. Each is covered in its originating phase's theory.
  • GPU memory layout. CPU-only demo; GPU memory analysis is a Phase 35 / 36 concern.
  • Streaming responses. The demo uses a single JSON response, not server-sent events. SSE is a Phase 33 extension flagged as Phase 40 reading.
  • Multi-step agent loops. The grammar tutor is single-turn. Multi-turn agent loops with planning live in Phase 32 territory and are out of scope here.

Next: theory/03-cost-and-observability-stitching.md — how Phase 34 (cost) + Phase 38 (CpQU) + Prometheus + Grafana + Tempo all become one dashboard.