English · Español
Lab 04 — Survey: vLLM, TGI, and what production adds¶
🇪🇸 Has implementado un mini continuous batcher. Los sistemas de producción (vLLM, TGI, Triton, Ray Serve) hacen lo mismo a otra escala. Esta lab es lectura, no código: identifica qué añaden, por qué, y qué dejas sobre la mesa al usarlos.
Objective¶
A no-code lab. Read the documentation and key papers of three production inference servers, identify what they add beyond the lab-03 scheduler, and write a 2-page comparison. This builds vocabulary you'll need in Phases 34 (observability), 35 (distributed), and 39 (capstone).
Required reading¶
-
vLLM — PagedAttention paper. Kwon et al. 2023, "Efficient Memory Management for Large Language Model Serving with PagedAttention." Read the abstract, §1 (motivation), §3 (PagedAttention), §4 (scheduling). Skip the kernel internals on first pass.
-
TGI (Hugging Face Text Generation Inference) — README and architecture doc. Read the GitHub README and the docs page on "How TGI works internally" (or equivalent at time of reading).
-
NVIDIA Triton Inference Server — Concepts. Read the "Inference Request Lifecycle" and "Dynamic Batching" pages. Triton supports many model types; focus on LLM serving.
-
(Optional) Ray Serve / Anyscale Endpoints documentation for the orchestration layer.
Tasks¶
- Build a feature matrix. Make a markdown table with columns for each system (Your lab-03 / vLLM / TGI / Triton). Rows:
| Feature | Your lab-03 | vLLM | TGI | Triton |
|---|---|---|---|---|
| Continuous batching | ✅ | ✅ | ✅ | ✅ (dynamic batching) |
| PagedAttention KV cache | ❌ | ✅ | partial | varies |
| Speculative decoding | ❌ | ✅ | ✅ | varies |
| Tensor parallel | ❌ | ✅ | ✅ | ✅ |
| Prefix caching | ❌ | ✅ | partial | ❌ |
| Streaming output | ❌ | ✅ | ✅ | ✅ |
| Multi-model hosting | ❌ | ❌ | ❌ | ✅ |
| Native HTTP API | ✅ (FastAPI) | ✅ (OpenAI-compatible) | ✅ | gRPC/HTTP |
| Adapter (LoRA) hot-swap | ❌ | ✅ | ✅ | partial |
| Per-request log probs | ✅ (trivial) | ✅ | ✅ | varies |
| Open source | n/a | Apache 2.0 | Apache 2.0 | BSD-3 |
Add 3-5 more rows of your own based on the reading.
-
Identify the one thing each system uniquely adds.
-
vLLM: PagedAttention — solves the KV-cache memory fragmentation problem at scale.
- TGI: tight HuggingFace integration + Rust performance for the scheduler.
- Triton: backend-agnostic — same server can host ONNX, TensorRT, PyTorch, custom Python.
For each, write 3-5 sentences in plain English.
- Identify what you'd lose by switching. If you replaced your lab-03 scheduler with vLLM tomorrow, what would you stop being able to do? Examples:
- Custom agent loop (vLLM is built for completion-style APIs, not agent-with-tools).
- Custom sampling strategies (vLLM has its own; integrating yours is non-trivial).
-
Visibility into the scheduler (vLLM's internals are abstracted — for production, this is fine; for learning, it's a wall).
-
Identify what you'd gain. Same exercise from the other direction:
- PagedAttention → 2-4× more concurrent requests at the same memory.
- Prefix caching → ~1.5-3× speedup for chat workloads with shared prefixes.
-
Production hardening (rate limiting, metrics, multi-replica) without writing it yourself.
-
Decision criterion. Write 3-5 sentences: "Lynx Cortex's grammar tutor would use [X] in production because [reasons]." There's no single right answer. The point is to articulate the trade-off.
-
Forward references. List the phases that build on this:
- Phase 34 — observability: vLLM emits Prometheus metrics; you'll consume them.
- Phase 35 — distributed: tensor parallel arrives in vLLM/TGI.
- Phase 39 — capstone: the production grammar tutor; you'll likely run vLLM or TGI under the hood.
Deliverables¶
Save to experiments/<date>-phase-33-lab-04/:
survey.md— the 2-page write-up containing the feature matrix and your discussion.decision.md— your decision criterion with reasoning.
Acceptance¶
- Survey is factually correct (no hallucinated features). Cite sources by URL or paper title.
- Feature matrix has at least 12 rows (the 11 above + at least one of your own).
- Decision criterion is specific to the §A13 verb-tutor workload, not generic.
Pitfalls¶
- Confusing dynamic batching (Triton's term) with continuous batching. Dynamic batching is what you built in lab 02 (static batching with a flexible deadline). Continuous batching is what you built in lab 03 (iteration-level scheduling). Triton's "dynamic batching" is closer to lab 02; vLLM and TGI implement true continuous batching.
- Reading too deep. This is a 2-3 hour reading lab. Don't get sucked into the PagedAttention kernel internals — that's Phase 27.
- Conflating frameworks. vLLM, TGI, Triton, Ray Serve, KServe, BentoML, Modal — they're not all the same layer. vLLM/TGI are engines; Triton/Ray Serve are orchestrators; KServe is Kubernetes glue. Be precise.
Stretch¶
- Run vLLM locally (CPU-only, on a small model from HF Hub if compatible) and serve a single request. Compare its
curlinterface to your lab-03 service. - Compare your lab-03 latency on the verb workload to vLLM's (if compatible). Don't expect to win; vLLM is a multi-year engineering investment. The exercise is to see the magnitude of the gap.
End of Phase 33 labs. Time to write PHASE_33_REPORT.md and reflect.