Skip to content

English · Español

Phase 33 — Inference Serving: From FastAPI to Continuous Batching

Requires: 22 — KV Cache: From Math to Memory · 32 — Agents: Planning, Memory, Sandboxing (Grammar Tutor) Teaches: serving · continuous-batching · scheduling · littles-law · load-testing Jump to any chapter from the phase reference index.

Chapter map

🇪🇸 La Fase 32 construyó el agente tutor. La Fase 33 lo pone detrás de un endpoint HTTP que aguante carga concurrente con continuous batching. CPU-only, sin vLLM, sin Ray. La idea es sentir el coste del scheduling antes de delegar.

Anchors: LYNX_CORTEX.md §4 / PHASE 33, PHASE_33_PLAN.md, LYNX_CORTEX_ADDENDUM.md §A13.

Goal

Take the Phase 32 grammar-tutor agent and put it behind a production-shaped HTTP service. Cover the path from uvicorn + FastAPI synchronous serving → async handlers → static batching → continuous (in-flight) batching.

By the end Borja can: (a) write a minimal FastAPI service that wraps the agent; (b) explain why static batching has tail-latency problems and what continuous batching changes; © implement a tiny continuous-batching scheduler in Python (single-process, no vLLM dependency); (d) benchmark p50/p95/p99 latency on the verb-correction workload; (e) apply Little's law to size the queue.

What you'll build

Extensions to src/miniserve/:

src/miniserve/
├── BLUEPRINT.md       # written at Phase 33 pre-flight, before any code
├── app.py             # FastAPI app + route handlers
├── schemas.py         # Pydantic request / response types
├── scheduler.py       # Static + continuous batching schedulers
└── healthz.py         # Liveness + readiness probes

The HTTP API:

POST /correct
{"sentence": "He goed to school"}

{"corrected": "He went to school",
 "explanation": "The past tense of 'go' is the irregular form 'went'."}

What this phase does NOT cover

  • GPU serving / multi-node. Phase 35.
  • vLLM, TGI, Ray Serve as dependencies. Surveyed in lab 04; not used.
  • Streaming responses (text/event-stream). Deferred to Phase 39 polish.
  • Authentication / authz. Phase 37.
  • PagedAttention. Phase 27 (covered there); references only here.
  • Cost & capacity dashboards. Phase 34.

Read order

  1. theory/00-motivation.md — why HTTP serving needs more than a for loop.
  2. theory/01-async-and-the-event-loop.md — sync vs async handlers, the GIL.
  3. theory/02-static-vs-continuous-batching.md — the scheduling change.
  4. theory/03-littles-law-and-capacity.md — sizing the queue.
  5. lab/00-minimal-fastapi.mdPOST /correct, curl roundtrip.
  6. lab/01-sync-vs-async.md — load test, fix the blocking handler.
  7. lab/02-static-batching.md — collect-N-then-run.
  8. lab/03-continuous-batching.md — iteration-level scheduler.
  9. lab/04-vllm-and-tgi-survey.md — what production systems add on top.

solutions/ populated at phase open.

Definition of Done

See PHASE_33_PLAN.md §6 (at repo root). Briefly:

  • POST /correct works end-to-end on the verb-corpus prompts.
  • Continuous batching beats static batching on p95 by ≥ 30% on the standard load test.
  • /healthz and /readyz implemented.
  • Latency CDFs (sync vs async vs static-batched vs continuous-batched) saved.

Building blocks the portal will reuse (Phase 41)

Phase 41's Learner Portal (docs/phase-41-learner-portal/) is a second FastAPI app — multi-student, server-rendered, no batching — that reuses Phase 33's framework patterns. Theory chapter theory/04-portal-building-blocks.md teaches each pattern in isolation; the portal architecture (docs/phase-41-learner-portal/theory/01-architecture.md) composes them. Four concrete reuse points:

  1. Lifespan-managed sqlite + vault. The portal opens its SQLite engine and minivault handle inside a FastAPI lifespan async context manager, exactly as Phase 33's miniserve opens the agent. One process-wide handle, torn down in reverse order on shutdown. (§1 of theory 04.)
  2. Depends()-injected auth and DB session. Depends(require_student) and Depends(require_admin) gate every protected route; the request-scoped DB session comes from Depends(get_db_session). Same dependency-injection pattern Phase 33 uses for the agent handle. (§2 of theory 04.)
  3. ASGI middleware order with CSRF-after-session. The portal mounts the same miniserve middlewares (rate-limit, body-size, injection-filter) and adds session decoding + CSRF validation — with CSRF after session decoding, never before. (§3 of theory 04.)
  4. OpenAPI for admin endpoints. Phase 33 establishes the OpenAPI / /health / structured-log conventions; the portal's /admin/* routes inherit them so a future external admin client can introspect the schema. (Cross-link theory/04-portal-building-blocks.md and the portal architecture chapter.)

Next: theory/00-motivation.md

Further reading

Optional — enrichment, not required to pass the phase.