English · Español
Theory 02 — End-to-end data flow: one request, every layer¶
🇪🇸 Una sola petición HTTP del tutor de gramática atraviesa nueve capas, desde el
accept(2)del kernel hasta el JSON en el socket de respuesta. Aquí seguimos cada byte: cuántos, en qué forma, en qué etapa del latency budget y con qué percentil. Sumar p95 por etapa no da el p95 total — esa falacia se demuestra al final del capítulo.
Why a byte-level walkthrough matters¶
Phases 11–34 each built one stage of the pipeline in isolation. The grammar tutor's request path is the first time a learner sees them composed. Three things break under composition that no single-stage view exposes:
- Byte format mismatches. Phase 11's BPE emits
list[int]; Phase 17's Mini-GPT consumestorch.LongTensorof shape(batch, seq_len). Thelist[int]→ tensor cast is somewhere; if it's in the wrong place, you allocate twice. - Latency budget collisions. Each stage thinks it has 500 ms. The user has 5 s total. Math doesn't add up unless someone audits the sum, which is Phase 39's job.
- Percentile arithmetic. Engineers fluent in percentiles still get this wrong: \(p_{95}(A + B) \ne p_{95}(A) + p_{95}(B)\). The capstone dashboard's per-stage p95 panel is informational; only the end-to-end p95 (computed on full traces) is load-bearing.
This chapter walks one request — POST /v1/grammar/correct with body {"sentence": "Yesterday I goed to the store"} — through every layer with concrete numbers.
The nine layers¶
| # | Layer | Phase | Input | Output | Typical bytes | Typical ms |
|---|---|---|---|---|---|---|
| 1 | HTTP ingress (FastAPI + uvicorn) | 33 | TCP bytes | Request object |
~600 B in, ~400 B out | 0.5 |
| 2 | Schema validation (Pydantic + Phase 30 output schema) | 30 | Request body |
CorrectRequest pydantic |
n/a | 0.3 |
| 3 | Security guards (rate-limit, body-size, injection filter) | 33, 37 | CorrectRequest |
CorrectRequest (or 4xx) |
n/a | 0.2 |
| 4 | BPE tokenization | 11 | str ("Yesterday I goed...") |
list[int] of length ~10 |
~30 B → 80 B (int64) | 1.0 |
| 5 | RAG retrieval (bi-encoder + BM25 hybrid) | 29 | list[int] or raw text |
top-5 chunks (~2 KB) | ~80 B → ~2 KB | 25 |
| 6 | Mini-GPT prefill (encode prompt + context) | 17, 22 | (1, ~80) int64 |
(1, 80, |V|) float32 |
~80 B → ~5 MB logits + ~3 MB KV cache | 800 |
| 7 | Mini-GPT decode (generate ~30 tokens) | 17, 21, 22 | KV cache | (1, ~30) int64 + 30 × KV-cache appends |
~30 × 100 KB = 3 MB | 2,500 |
| 8 | Output formatting (structured JSON via Phase 30 schema) | 30 | (1, ~30) int64 + BPE decode |
CorrectResponse JSON |
~80 B → ~400 B | 2.0 |
| 9 | Cost emission + trace flush + response write | 34 | CorrectResponse |
TCP bytes | 400 B out + ~2 KB metrics | 1.5 |
Total wall time (typical, single request, warm cache): ≈ 3,330 ms.
Per-target: the DoD requires p95 < 5,000 ms under 10 concurrent requests (Plan §7). The 3.3 s typical leaves a 1.7 s buffer for tail latency and queueing. Phase 33's continuous-batching theory says queue wait dominates the tail; the capstone dashboard's queue depth panel is the leading indicator.
Where the bytes are¶
Two patterns to internalize:
- Inputs are tiny, internal state is large. Request body: 600 B. KV cache during decode: ~3 MB. That ratio (5000×) explains why Phase 22 (KV cache) matters: the cheap thing to optimize would be the bytes the user sends; the valuable thing to optimize is the bytes the model keeps in RAM.
- The dominant cost is decode, not prefill. Prefill is 800 ms for the whole prompt at once; decode is 2,500 ms for 30 tokens one at a time. The decode-dominance is intrinsic to autoregressive generation and is the headline finding of Phase 21's cost model.
The latency budget¶
Allocate 5,000 ms across stages. Phase 39's Plan §2 says: allocate proportional to \(\sigma_i\) (each stage's standard deviation). In practice for a CPU-only single-node demo:
| Stage | Allocation | Justification |
|---|---|---|
| HTTP + schema + guards | 50 ms | Low variance; near-constant on warm process |
| BPE tokenize | 20 ms | Low variance; pure-Python loop, but bounded |
| RAG retrieve | 200 ms | Moderate variance; index size affects tail |
| Mini-GPT prefill | 1,200 ms | High variance with sequence length |
| Mini-GPT decode | 3,300 ms | Dominant. Variance with output length |
| Format + cost + flush | 30 ms | Low variance |
| Buffer (queueing, GC pauses) | 200 ms | Slack |
| Total budget | 5,000 ms | matches DoD target |
The Grafana dashboard's latency budget panel renders this as a stacked bar; the actual timing per request is rendered below it. Visual diffs (budget vs actual) catch budget violations within seconds of regression.
What "budget" actually means¶
A latency budget is not "no stage will ever exceed its allocation." It's "if a stage exceeds repeatedly, that's a debt to pay down." The dashboard's budget burn rate metric (stage time / allocation) is monitored; sustained burn > 1.5× for any stage triggers a Phase-40 carry-over.
The percentile-addition fallacy¶
A common bug in performance dashboards:
"Each stage's p95 is shown. The sum of p95s is reported as 'end-to-end p95'."
This is wrong. Mathematically:
The intuition: the p95 of stage A and the p95 of stage B are typically observed on different requests. The request that's slow at stage A is not the same request that's slow at stage B. Summing per-stage p95s assumes pessimistic correlation; the actual end-to-end distribution has lighter tails.
A concrete demonstration¶
Take two stages, each with latency uniformly distributed in \([100, 1000]\) ms:
- \(p_{95}(A) = 955\) ms
- \(p_{95}(B) = 955\) ms
- Naïve sum: \(1910\) ms
- Actual \(p_{95}(A + B)\) (via simulation of \(10^6\) samples): ≈ \(1690\) ms
The naïve estimate overstates the real p95 by ~13%. In a dashboard with 9 stages, the overstatement compounds. Engineers see "tail latency" that isn't actually there and chase the wrong optimization.
Mitigation: the dashboard computes end-to-end percentiles from full request traces (Tempo / Jaeger), not from per-stage summary statistics. Per-stage p95 is informational; it tells you which stage is consistently slowest, not what the end-to-end tail is.
Lab 01 includes a one-shot script that fetches recent traces, computes both numbers, and prints their delta. The first run typically shows a >10% delta — the capstone makes this visible.
Trace propagation: making the picture work¶
The dashboard's end-to-end latency view depends on every stage emitting a span with the same trace_id and a parent-child relationship. The contract from Theory 01:
| Stage | Emits span | Parent |
|---|---|---|
| 1 — HTTP ingress | http.request |
(root) |
| 2 — Validation | validate.body |
http.request |
| 3 — Guards | security.check |
http.request |
| 4 — BPE | tokenize.bpe |
http.request |
| 5 — RAG | retrieve.hybrid |
http.request |
| 6 — Prefill | model.prefill |
http.request |
| 7 — Decode | model.decode |
http.request |
| 8 — Format | format.json |
http.request |
| 9 — Cost | cost.emit |
http.request |
Every span carries request_id as an attribute. The Phase 34 cost emitter writes the cost as a span attribute (not a separate metric line) so the trace and the cost are joined on a single ID without a downstream JOIN.
Cross-process boundary: the MCP tool¶
If the request triggers an MCP tool call (rare in the grammar tutor path; common in the security run-through of Lab 03), the child process must inherit the trace context. Mechanism:
- Parent serializes
traceparent+tracestateinto env vars (W3C trace-context spec). - Subprocess reads them on startup and re-establishes the span context.
- Subprocess emits its spans as children of the parent.
Phase 31's tool blueprint encodes this; Phase 39 verifies it in Lab 03 by asserting the malicious-payload span shows as a child of the demo's request span.
Failure mode (Pitfall 5 from the Plan): if TRACEPARENT isn't propagated, the subprocess emits orphan spans. The dashboard's "Orphan span count" panel fires an alert. Lab 01 has a one-line audit.
The byte journey, condensed¶
A 600-byte request becomes:
- 30 B of UTF-8 sentence text (layer 4 input).
- 80 B of int64 token IDs (layer 4 output, layer 6 input).
- 5 MB of float32 logits (layer 6 internal).
- 3 MB of KV cache (layer 6 / 7 internal).
- 400 B of JSON response (layer 8 output, layer 9 input).
- 600 B of HTTP response + 2 KB of metrics + ~5 KB of trace data (layer 9 output).
The 5,000× amplification from input to peak internal state is the raison d'être of every memory-aware optimization the curriculum touched: tokenizer compactness (Phase 11), quantization (Phase 26), KV cache sharing (Phase 22), continuous batching (Phase 33). The capstone makes this amplification visible.
What this theory does NOT cover¶
- The internals of any single stage. Each is covered in its originating phase's theory.
- GPU memory layout. CPU-only demo; GPU memory analysis is a Phase 35 / 36 concern.
- Streaming responses. The demo uses a single JSON response, not server-sent events. SSE is a Phase 33 extension flagged as Phase 40 reading.
- Multi-step agent loops. The grammar tutor is single-turn. Multi-turn agent loops with planning live in Phase 32 territory and are out of scope here.
Next: theory/03-cost-and-observability-stitching.md — how Phase 34 (cost) + Phase 38 (CpQU) + Prometheus + Grafana + Tempo all become one dashboard.