English · Español
Phase 33 — Inference Serving: From FastAPI to Continuous Batching¶
Requires: 22 — KV Cache: From Math to Memory · 32 — Agents: Planning, Memory, Sandboxing (Grammar Tutor) Teaches:
serving·continuous-batching·scheduling·littles-law·load-testingJump to any chapter from the phase reference index.
Chapter map¶
🇪🇸 La Fase 32 construyó el agente tutor. La Fase 33 lo pone detrás de un endpoint HTTP que aguante carga concurrente con continuous batching. CPU-only, sin vLLM, sin Ray. La idea es sentir el coste del scheduling antes de delegar.
Anchors: LYNX_CORTEX.md §4 / PHASE 33, PHASE_33_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13.
Goal¶
Take the Phase 32 grammar-tutor agent and put it behind a production-shaped HTTP service. Cover the path from uvicorn + FastAPI synchronous serving → async handlers → static batching → continuous (in-flight) batching.
By the end Borja can: (a) write a minimal FastAPI service that wraps the agent; (b) explain why static batching has tail-latency problems and what continuous batching changes; © implement a tiny continuous-batching scheduler in Python (single-process, no vLLM dependency); (d) benchmark p50/p95/p99 latency on the verb-correction workload; (e) apply Little's law to size the queue.
What you'll build¶
Extensions to src/miniserve/:
src/miniserve/
├── BLUEPRINT.md # written at Phase 33 pre-flight, before any code
├── app.py # FastAPI app + route handlers
├── schemas.py # Pydantic request / response types
├── scheduler.py # Static + continuous batching schedulers
└── healthz.py # Liveness + readiness probes
The HTTP API:
POST /correct
{"sentence": "He goed to school"}
↓
{"corrected": "He went to school",
"explanation": "The past tense of 'go' is the irregular form 'went'."}
What this phase does NOT cover¶
- GPU serving / multi-node. Phase 35.
- vLLM, TGI, Ray Serve as dependencies. Surveyed in lab 04; not used.
- Streaming responses (text/event-stream). Deferred to Phase 39 polish.
- Authentication / authz. Phase 37.
- PagedAttention. Phase 27 (covered there); references only here.
- Cost & capacity dashboards. Phase 34.
Read order¶
theory/00-motivation.md— why HTTP serving needs more than aforloop.theory/01-async-and-the-event-loop.md— sync vs async handlers, the GIL.theory/02-static-vs-continuous-batching.md— the scheduling change.theory/03-littles-law-and-capacity.md— sizing the queue.lab/00-minimal-fastapi.md—POST /correct, curl roundtrip.lab/01-sync-vs-async.md— load test, fix the blocking handler.lab/02-static-batching.md— collect-N-then-run.lab/03-continuous-batching.md— iteration-level scheduler.lab/04-vllm-and-tgi-survey.md— what production systems add on top.
solutions/ populated at phase open.
Definition of Done¶
See PHASE_33_PLAN.md §6 (at repo root). Briefly:
POST /correctworks end-to-end on the verb-corpus prompts.- Continuous batching beats static batching on p95 by ≥ 30% on the standard load test.
/healthzand/readyzimplemented.- Latency CDFs (sync vs async vs static-batched vs continuous-batched) saved.
Building blocks the portal will reuse (Phase 41)¶
Phase 41's Learner Portal (docs/phase-41-learner-portal/) is a second FastAPI app — multi-student, server-rendered, no batching — that reuses Phase 33's framework patterns. Theory chapter theory/04-portal-building-blocks.md teaches each pattern in isolation; the portal architecture (docs/phase-41-learner-portal/theory/01-architecture.md) composes them. Four concrete reuse points:
- Lifespan-managed sqlite + vault. The portal opens its SQLite engine and
minivaulthandle inside a FastAPIlifespanasync context manager, exactly as Phase 33'sminiserveopens the agent. One process-wide handle, torn down in reverse order on shutdown. (§1 of theory 04.) Depends()-injected auth and DB session.Depends(require_student)andDepends(require_admin)gate every protected route; the request-scoped DB session comes fromDepends(get_db_session). Same dependency-injection pattern Phase 33 uses for the agent handle. (§2 of theory 04.)- ASGI middleware order with CSRF-after-session. The portal mounts the same
miniservemiddlewares (rate-limit, body-size, injection-filter) and adds session decoding + CSRF validation — with CSRF after session decoding, never before. (§3 of theory 04.) - OpenAPI for admin endpoints. Phase 33 establishes the OpenAPI /
/health/ structured-log conventions; the portal's/admin/*routes inherit them so a future external admin client can introspect the schema. (Cross-linktheory/04-portal-building-blocks.mdand the portal architecture chapter.)
Next: theory/00-motivation.md
Further reading¶
Optional — enrichment, not required to pass the phase.
- 📄 Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al. · 2022. the paper that introduced continuous batching.