English · Español

05 — A worked latency budget for the mini-GPT on i5-8250U¶

🇪🇸 Un presupuesto de latencia es una hoja de cálculo, no una intuición. Para servir el mini-GPT del §A13 sobre un i5-8250U vía FastAPI, sumamos cada componente y vemos quién manda. La respuesta para este modelo a esta escala: la gestión del KV-cache no es el cuello de botella; lo es la falta de batching cuando llegan varias peticiones a la vez. Cross-ref: Fase 22 (KV-cache), Fase 41 (el portal es el ejemplo vivo).

The budget¶

Borja's machine: Intel i5-8250U, 4C/8T Kaby Lake R, 62 GiB RAM, no CUDA. The §A13 grammar-tutor model is microscopic: ~500k params, vocab 600-ish forms, context window 64. A single tutor request encodes one short English sentence, runs the forward pass over the prefix, then auto-regressively decodes the correction (typically ≤ 16 tokens) plus a Spanish gloss (typically ≤ 8 tokens).

Naming, all per-request:

t_parse — JSON parse + Pydantic validation of the incoming request.
t_tokenize — BPE encode of the input sentence (Phase 11 tokenizer).
t_prefill — first forward pass over the full prefix (N input tokens, single shot).
t_decode_step — one auto-regressive decode step (KV-cache hit on every prior token).
t_detok — detokenize the output ids back to a UTF-8 string.
t_serialize — JSON-serialize the response.
K — number of decode tokens we generate.
t_total = t_parse + t_tokenize + t_prefill + K · t_decode_step + t_detok + t_serialize.

Measured numbers on i5-8250U (NumPy + hand-built attention, single thread)¶

These are the order-of-magnitude figures Phase 22's KV-cache lab produced. They are deliberately conservative — your wall-clock will land within 2× either way depending on BLAS vendor and thermal state.

Component	Cost (ms)	Notes
`t_parse`	0.2	FastAPI + Pydantic, single dict, no nested models.
`t_tokenize`	0.8	600-merge BPE, regex pre-tokenize, ≤ 32 input tokens.
`t_prefill` (24 input toks)	18	One matmul per layer × 4 layers, attention is O(N²) but N=24.
`t_decode_step` (KV hit)	6.5	Linear in cached length but the cache is small (≤ 64).
`K` (decode tokens)	20	Correction + " / " + Spanish gloss, p50.
`t_detok`	0.3	Reverse merge.
`t_serialize`	0.2	JSON dump.
t_total (single request)	~150	`0.2 + 0.8 + 18 + 20·6.5 + 0.3 + 0.2 ≈ 149.5`.

p50 = ~150 ms. p95 with longer sentences and Spanish glosses creeping to K=28: ~190 ms.

Where is the time?¶

Decode dominates. 20 · 6.5 = 130 ms of the 150 ms total — 87% of wall-clock is in the decode loop. Prefill is a single 18 ms blip; everything around it (parse, tokenize, serialize) is ≤ 2 ms combined.

This is the right picture for a microscopic model on CPU. It is not the picture for a 7B model on a GPU, where prefill grows to seconds and decode is amortized per output token. The shape of the budget changes with the model, but the discipline — measure, attribute, then optimize the dominant term — does not.

So is the KV-cache the bottleneck?¶

For a single request: no. The KV-cache (Phase 22) is doing its job — without it, each decode step would re-compute attention over the full prefix and t_decode_step would grow from 6.5 ms to ~18 ms (the prefill cost). The cache saves us roughly \(20 \cdot 11.5 \approx 230\) ms per request, halving p50. The cache is the difference between a tutor that feels instant and one that feels stuck.

What the cache does not help with is throughput under concurrent load. Each connected client has its own KV-cache; the bytes are cheap (the model is tiny) but the CPU is shared. Two clients hitting /correct at the same time without batching just serialize on the GIL-bound NumPy matmul. p50 doubles. p95 quadruples.

The actual bottleneck: lack of batching¶

Run the lab 02 + lab 03 throughput sweep. The shape:

Concurrent clients	No batching p50	Static batch p50	Continuous batch p50
1	150 ms	150 ms	150 ms
2	295 ms	175 ms	165 ms
4	590 ms	220 ms	195 ms
8	1180 ms	320 ms	240 ms

Without batching, p50 grows linearly with concurrency — each request waits for the previous to finish a token. With static batching (collect 4 requests, run them as one matmul), the matmul cost rises sub-linearly because BLAS amortizes the FMA overhead; p50 grows roughly logarithmically. With continuous batching (Phase 33 lab 03), short requests leave early and don't get stuck behind long ones — p95 collapses too.

The §A13 grammar tutor at 8-client load is batching-bound, not cache-bound. The KV-cache is necessary but not sufficient. Phase 22 + Phase 33 are designed to teach exactly this lesson: the cache makes single-request latency tractable; batching makes multi-request throughput tractable.

Cross-reference to Phase 41 (the portal)¶

The Phase 41 learner portal is the working example of the served system. It calls the Phase 32 grammar tutor as one of several endpoints (the "quiz me" and "exam" surfaces). Re-read docs/phase-41-learner-portal/theory/01-architecture.md (do not modify it) — the portal owns the lifespan / Depends / middleware story; Phase 33 owns the forward-pass latency story. They compose:

The portal's /quiz/submit handler is request-scoped (per-student session, CSRF, audit log) and adds ~3 ms of overhead on top of the 150 ms model call. Negligible.
The portal's load model is bounded: even with 30 concurrent learners, RPS at the tutor is bounded by think time between submissions (humans read explanations for seconds). Expected steady-state ≈ 1-2 RPS. Comfortable inside the batching budget.

The portal does not change the latency budget; it consumes it.

Engineering rule of thumb (CPU, microscopic model)¶

Symptom	Likely cause	Phase to revisit
p50 too high even at C=1	KV-cache disabled or mis-keyed	22
p50 grows linearly with C	No batching	33 lab 02/03
p95 ≫ p50	Static batch tail; or long-tail input sizes	33 lab 03
OOM under sustained load	Per-request KV-cache buffers never freed	22 + 33 admission
p50 fine, throughput stalls	Single-threaded NumPy BLAS contention	23 (X4)

What this chapter does NOT cover¶

GPU latency budgets — Phase 23+ territory.
Speculative decoding to compress K — Phase 36 survey only.
Cross-region latency, CDN, TLS handshake. Beyond §A13 scope.

Reference¶

Patel et al., "Splitwise: Efficient Generative LLM Inference Using Phase Splitting" (ISCA 2024). The prefill-vs-decode separation our budget makes explicit.
Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023). The KV-cache memory model Phase 22 borrows from.

Next: ../lab/00-minimal-fastapi.md or revisit 03-littles-law-and-capacity.md for the throughput side.