English · Español
05 — A worked latency budget for the mini-GPT on i5-8250U¶
🇪🇸 Un presupuesto de latencia es una hoja de cálculo, no una intuición. Para servir el mini-GPT del §A13 sobre un i5-8250U vía FastAPI, sumamos cada componente y vemos quién manda. La respuesta para este modelo a esta escala: la gestión del KV-cache no es el cuello de botella; lo es la falta de batching cuando llegan varias peticiones a la vez. Cross-ref: Fase 22 (KV-cache), Fase 41 (el portal es el ejemplo vivo).
The budget¶
Borja's machine: Intel i5-8250U, 4C/8T Kaby Lake R, 62 GiB RAM, no CUDA. The §A13 grammar-tutor model is microscopic: ~500k params, vocab 600-ish forms, context window 64. A single tutor request encodes one short English sentence, runs the forward pass over the prefix, then auto-regressively decodes the correction (typically ≤ 16 tokens) plus a Spanish gloss (typically ≤ 8 tokens).
Naming, all per-request:
t_parse— JSON parse + Pydantic validation of the incoming request.t_tokenize— BPE encode of the input sentence (Phase 11 tokenizer).t_prefill— first forward pass over the full prefix (N input tokens, single shot).t_decode_step— one auto-regressive decode step (KV-cache hit on every prior token).t_detok— detokenize the output ids back to a UTF-8 string.t_serialize— JSON-serialize the response.K— number of decode tokens we generate.t_total = t_parse + t_tokenize + t_prefill + K · t_decode_step + t_detok + t_serialize.
Measured numbers on i5-8250U (NumPy + hand-built attention, single thread)¶
These are the order-of-magnitude figures Phase 22's KV-cache lab produced. They are deliberately conservative — your wall-clock will land within 2× either way depending on BLAS vendor and thermal state.
| Component | Cost (ms) | Notes |
|---|---|---|
t_parse |
0.2 | FastAPI + Pydantic, single dict, no nested models. |
t_tokenize |
0.8 | 600-merge BPE, regex pre-tokenize, ≤ 32 input tokens. |
t_prefill (24 input toks) |
18 | One matmul per layer × 4 layers, attention is O(N²) but N=24. |
t_decode_step (KV hit) |
6.5 | Linear in cached length but the cache is small (≤ 64). |
K (decode tokens) |
20 | Correction + " / " + Spanish gloss, p50. |
t_detok |
0.3 | Reverse merge. |
t_serialize |
0.2 | JSON dump. |
| t_total (single request) | ~150 | 0.2 + 0.8 + 18 + 20·6.5 + 0.3 + 0.2 ≈ 149.5. |
p50 = ~150 ms. p95 with longer sentences and Spanish glosses creeping to K=28: ~190 ms.
Where is the time?¶
Decode dominates. 20 · 6.5 = 130 ms of the 150 ms total — 87% of wall-clock is in the decode loop. Prefill is a single 18 ms blip; everything around it (parse, tokenize, serialize) is ≤ 2 ms combined.
This is the right picture for a microscopic model on CPU. It is not the picture for a 7B model on a GPU, where prefill grows to seconds and decode is amortized per output token. The shape of the budget changes with the model, but the discipline — measure, attribute, then optimize the dominant term — does not.
So is the KV-cache the bottleneck?¶
For a single request: no. The KV-cache (Phase 22) is doing its job — without it, each decode step would re-compute attention over the full prefix and t_decode_step would grow from 6.5 ms to ~18 ms (the prefill cost). The cache saves us roughly \(20 \cdot 11.5 \approx 230\) ms per request, halving p50. The cache is the difference between a tutor that feels instant and one that feels stuck.
What the cache does not help with is throughput under concurrent load. Each connected client has its own KV-cache; the bytes are cheap (the model is tiny) but the CPU is shared. Two clients hitting /correct at the same time without batching just serialize on the GIL-bound NumPy matmul. p50 doubles. p95 quadruples.
The actual bottleneck: lack of batching¶
Run the lab 02 + lab 03 throughput sweep. The shape:
| Concurrent clients | No batching p50 | Static batch p50 | Continuous batch p50 |
|---|---|---|---|
| 1 | 150 ms | 150 ms | 150 ms |
| 2 | 295 ms | 175 ms | 165 ms |
| 4 | 590 ms | 220 ms | 195 ms |
| 8 | 1180 ms | 320 ms | 240 ms |
Without batching, p50 grows linearly with concurrency — each request waits for the previous to finish a token. With static batching (collect 4 requests, run them as one matmul), the matmul cost rises sub-linearly because BLAS amortizes the FMA overhead; p50 grows roughly logarithmically. With continuous batching (Phase 33 lab 03), short requests leave early and don't get stuck behind long ones — p95 collapses too.
The §A13 grammar tutor at 8-client load is batching-bound, not cache-bound. The KV-cache is necessary but not sufficient. Phase 22 + Phase 33 are designed to teach exactly this lesson: the cache makes single-request latency tractable; batching makes multi-request throughput tractable.
Cross-reference to Phase 41 (the portal)¶
The Phase 41 learner portal is the working example of the served system. It calls the Phase 32 grammar tutor as one of several endpoints (the "quiz me" and "exam" surfaces). Re-read docs/phase-41-learner-portal/theory/01-architecture.md (do not modify it) — the portal owns the lifespan / Depends / middleware story; Phase 33 owns the forward-pass latency story. They compose:
- The portal's
/quiz/submithandler is request-scoped (per-student session, CSRF, audit log) and adds ~3 ms of overhead on top of the 150 ms model call. Negligible. - The portal's load model is bounded: even with 30 concurrent learners, RPS at the tutor is bounded by think time between submissions (humans read explanations for seconds). Expected steady-state ≈ 1-2 RPS. Comfortable inside the batching budget.
The portal does not change the latency budget; it consumes it.
Engineering rule of thumb (CPU, microscopic model)¶
| Symptom | Likely cause | Phase to revisit |
|---|---|---|
| p50 too high even at C=1 | KV-cache disabled or mis-keyed | 22 |
| p50 grows linearly with C | No batching | 33 lab 02/03 |
| p95 ≫ p50 | Static batch tail; or long-tail input sizes | 33 lab 03 |
| OOM under sustained load | Per-request KV-cache buffers never freed | 22 + 33 admission |
| p50 fine, throughput stalls | Single-threaded NumPy BLAS contention | 23 (X4) |
What this chapter does NOT cover¶
- GPU latency budgets — Phase 23+ territory.
- Speculative decoding to compress
K— Phase 36 survey only. - Cross-region latency, CDN, TLS handshake. Beyond §A13 scope.
Reference¶
- Patel et al., "Splitwise: Efficient Generative LLM Inference Using Phase Splitting" (ISCA 2024). The prefill-vs-decode separation our budget makes explicit.
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023). The KV-cache memory model Phase 22 borrows from.
Next: ../lab/00-minimal-fastapi.md or revisit 03-littles-law-and-capacity.md for the throughput side.