English · Español

Theory 02 — End-to-end data flow: one request, every layer¶

🇪🇸 Una sola petición HTTP del tutor de gramática atraviesa nueve capas, desde el accept(2) del kernel hasta el JSON en el socket de respuesta. Aquí seguimos cada byte: cuántos, en qué forma, en qué etapa del latency budget y con qué percentil. Sumar p95 por etapa no da el p95 total — esa falacia se demuestra al final del capítulo.

Why a byte-level walkthrough matters¶

Phases 11–34 each built one stage of the pipeline in isolation. The grammar tutor's request path is the first time a learner sees them composed. Three things break under composition that no single-stage view exposes:

Byte format mismatches. Phase 11's BPE emits list[int]; Phase 17's Mini-GPT consumes torch.LongTensor of shape (batch, seq_len). The list[int] → tensor cast is somewhere; if it's in the wrong place, you allocate twice.
Latency budget collisions. Each stage thinks it has 500 ms. The user has 5 s total. Math doesn't add up unless someone audits the sum, which is Phase 39's job.
Percentile arithmetic. Engineers fluent in percentiles still get this wrong: \(p_{95}(A + B) \ne p_{95}(A) + p_{95}(B)\). The capstone dashboard's per-stage p95 panel is informational; only the end-to-end p95 (computed on full traces) is load-bearing.

This chapter walks one request — POST /v1/grammar/correct with body {"sentence": "Yesterday I goed to the store"} — through every layer with concrete numbers.

The nine layers¶

#	Layer	Phase	Input	Output	Typical bytes	Typical ms
1	HTTP ingress (FastAPI + uvicorn)	33	TCP bytes	`Request` object	~600 B in, ~400 B out	0.5
2	Schema validation (Pydantic + Phase 30 output schema)	30	`Request` body	`CorrectRequest` pydantic	n/a	0.3
3	Security guards (rate-limit, body-size, injection filter)	33, 37	`CorrectRequest`	`CorrectRequest` (or 4xx)	n/a	0.2
4	BPE tokenization	11	`str` (`"Yesterday I goed..."`)	`list[int]` of length ~10	~30 B → 80 B (int64)	1.0
5	RAG retrieval (bi-encoder + BM25 hybrid)	29	`list[int]` or raw text	top-5 chunks (~2 KB)	~80 B → ~2 KB	25
6	Mini-GPT prefill (encode prompt + context)	17, 22	`(1, ~80)` int64	`(1, 80, \|V\|)` float32	~80 B → ~5 MB logits + ~3 MB KV cache	800
7	Mini-GPT decode (generate ~30 tokens)	17, 21, 22	KV cache	`(1, ~30)` int64 + 30 × KV-cache appends	~30 × 100 KB = 3 MB	2,500
8	Output formatting (structured JSON via Phase 30 schema)	30	`(1, ~30)` int64 + BPE decode	`CorrectResponse` JSON	~80 B → ~400 B	2.0
9	Cost emission + trace flush + response write	34	`CorrectResponse`	TCP bytes	400 B out + ~2 KB metrics	1.5

Total wall time (typical, single request, warm cache): ≈ 3,330 ms.

Per-target: the DoD requires p95 < 5,000 ms under 10 concurrent requests (Plan §7). The 3.3 s typical leaves a 1.7 s buffer for tail latency and queueing. Phase 33's continuous-batching theory says queue wait dominates the tail; the capstone dashboard's queue depth panel is the leading indicator.

Where the bytes are¶

Two patterns to internalize:

Inputs are tiny, internal state is large. Request body: 600 B. KV cache during decode: ~3 MB. That ratio (5000×) explains why Phase 22 (KV cache) matters: the cheap thing to optimize would be the bytes the user sends; the valuable thing to optimize is the bytes the model keeps in RAM.
The dominant cost is decode, not prefill. Prefill is 800 ms for the whole prompt at once; decode is 2,500 ms for 30 tokens one at a time. The decode-dominance is intrinsic to autoregressive generation and is the headline finding of Phase 21's cost model.

The latency budget¶

Allocate 5,000 ms across stages. Phase 39's Plan §2 says: allocate proportional to \(\sigma_i\) (each stage's standard deviation). In practice for a CPU-only single-node demo:

Stage	Allocation	Justification
HTTP + schema + guards	50 ms	Low variance; near-constant on warm process
BPE tokenize	20 ms	Low variance; pure-Python loop, but bounded
RAG retrieve	200 ms	Moderate variance; index size affects tail
Mini-GPT prefill	1,200 ms	High variance with sequence length
Mini-GPT decode	3,300 ms	Dominant. Variance with output length
Format + cost + flush	30 ms	Low variance
Buffer (queueing, GC pauses)	200 ms	Slack
Total budget	5,000 ms	matches DoD target

The Grafana dashboard's latency budget panel renders this as a stacked bar; the actual timing per request is rendered below it. Visual diffs (budget vs actual) catch budget violations within seconds of regression.

What "budget" actually means¶

A latency budget is not "no stage will ever exceed its allocation." It's "if a stage exceeds repeatedly, that's a debt to pay down." The dashboard's budget burn rate metric (stage time / allocation) is monitored; sustained burn > 1.5× for any stage triggers a Phase-40 carry-over.

The percentile-addition fallacy¶

A common bug in performance dashboards:

"Each stage's p95 is shown. The sum of p95s is reported as 'end-to-end p95'."

This is wrong. Mathematically:

\[p_{95}\left(\sum_{i} X_i\right) \ne \sum_{i} p_{95}(X_i)\]

The intuition: the p95 of stage A and the p95 of stage B are typically observed on different requests. The request that's slow at stage A is not the same request that's slow at stage B. Summing per-stage p95s assumes pessimistic correlation; the actual end-to-end distribution has lighter tails.

A concrete demonstration¶

Take two stages, each with latency uniformly distributed in \([100, 1000]\) ms:

\(p_{95}(A) = 955\) ms
\(p_{95}(B) = 955\) ms
Naïve sum: \(1910\) ms
Actual \(p_{95}(A + B)\) (via simulation of \(10^6\) samples): ≈ \(1690\) ms

The naïve estimate overstates the real p95 by ~13%. In a dashboard with 9 stages, the overstatement compounds. Engineers see "tail latency" that isn't actually there and chase the wrong optimization.

Mitigation: the dashboard computes end-to-end percentiles from full request traces (Tempo / Jaeger), not from per-stage summary statistics. Per-stage p95 is informational; it tells you which stage is consistently slowest, not what the end-to-end tail is.

Lab 01 includes a one-shot script that fetches recent traces, computes both numbers, and prints their delta. The first run typically shows a >10% delta — the capstone makes this visible.

Trace propagation: making the picture work¶

The dashboard's end-to-end latency view depends on every stage emitting a span with the same trace_id and a parent-child relationship. The contract from Theory 01:

Stage	Emits span	Parent
1 — HTTP ingress	`http.request`	(root)
2 — Validation	`validate.body`	`http.request`
3 — Guards	`security.check`	`http.request`
4 — BPE	`tokenize.bpe`	`http.request`
5 — RAG	`retrieve.hybrid`	`http.request`
6 — Prefill	`model.prefill`	`http.request`
7 — Decode	`model.decode`	`http.request`
8 — Format	`format.json`	`http.request`
9 — Cost	`cost.emit`	`http.request`

Every span carries request_id as an attribute. The Phase 34 cost emitter writes the cost as a span attribute (not a separate metric line) so the trace and the cost are joined on a single ID without a downstream JOIN.

Cross-process boundary: the MCP tool¶

If the request triggers an MCP tool call (rare in the grammar tutor path; common in the security run-through of Lab 03), the child process must inherit the trace context. Mechanism:

Parent serializes traceparent + tracestate into env vars (W3C trace-context spec).
Subprocess reads them on startup and re-establishes the span context.
Subprocess emits its spans as children of the parent.

Phase 31's tool blueprint encodes this; Phase 39 verifies it in Lab 03 by asserting the malicious-payload span shows as a child of the demo's request span.

Failure mode (Pitfall 5 from the Plan): if TRACEPARENT isn't propagated, the subprocess emits orphan spans. The dashboard's "Orphan span count" panel fires an alert. Lab 01 has a one-line audit.

The byte journey, condensed¶

A 600-byte request becomes:

30 B of UTF-8 sentence text (layer 4 input).
80 B of int64 token IDs (layer 4 output, layer 6 input).
5 MB of float32 logits (layer 6 internal).
3 MB of KV cache (layer 6 / 7 internal).
400 B of JSON response (layer 8 output, layer 9 input).
600 B of HTTP response + 2 KB of metrics + ~5 KB of trace data (layer 9 output).

The 5,000× amplification from input to peak internal state is the raison d'être of every memory-aware optimization the curriculum touched: tokenizer compactness (Phase 11), quantization (Phase 26), KV cache sharing (Phase 22), continuous batching (Phase 33). The capstone makes this amplification visible.

What this theory does NOT cover¶

The internals of any single stage. Each is covered in its originating phase's theory.
GPU memory layout. CPU-only demo; GPU memory analysis is a Phase 35 / 36 concern.
Streaming responses. The demo uses a single JSON response, not server-sent events. SSE is a Phase 33 extension flagged as Phase 40 reading.
Multi-step agent loops. The grammar tutor is single-turn. Multi-turn agent loops with planning live in Phase 32 territory and are out of scope here.

Next: theory/03-cost-and-observability-stitching.md — how Phase 34 (cost) + Phase 38 (CpQU) + Prometheus + Grafana + Tempo all become one dashboard.