English · Español
02 — Systems Design for LLMs: 5 Whiteboard Prompts¶
🇪🇸 5 prompts de systems design para LLMs en producción. Cada uno con pizarra, matemáticas de capacidad (Little's law, presupuesto de GPU), modos de fallo y disciplina de costos (CpQU del Phase 38).
How to attack a systems-design prompt¶
- Clarify before drawing. "10M DAU" — what is the latency target? P50 or P99? Streaming or batch? What is the budget? "Don't know" is a valid answer from the interviewer; pin assumptions and proceed.
- Capacity first. Tokens/sec needed → GPU count → memory needed → cost per request. Most candidates dive into K8s diagrams without this math; that is the failure signal.
- Failure modes second. Enumerate at least 4: model OOM, queue saturation, hot-shard, network partition, quota exhaustion. Lab interviewers love this.
- Cost discipline last. Pick the cheapest design that meets the SLA. "Cost per quality unit" (CpQU, Phase 38) is the framing.
Prompt 1 — Design a chatbot serving 10M DAU¶
Clarifying questions (ask these out loud)¶
- DAU vs MAU? Assume 10M DAU.
- Avg conversations per user per day? Assume 5.
- Avg turns per conversation? Assume 8.
- Avg tokens per turn? Input 200, output 300.
- Latency SLA? P50 < 1s to first token, P99 < 3s; full response < 15s.
- Model size? Assume 70B dense.
Capacity math¶
- Total turns/day:
10M * 5 * 8 = 400M turns/day. - Peak QPS (assume 3x average for daily peak):
400M / 86400 * 3 ≈ 14k QPS. - Tokens per turn: 200 in + 300 out = 500 total. Output tokens dominate cost (decode is sequential, prefill is parallel).
- Total output tokens/sec at peak:
14k * 300 = 4.2M out tokens/sec. - A single H100 at 70B in fp16 generates roughly 30-50 tokens/sec single-stream; with continuous batching (vLLM, batch ~32 effective) you get 1000-2000 tokens/sec/GPU.
- GPUs needed for decode:
4.2M / 1500 ≈ 2800 H100s. Round up for headroom → 3500 GPUs.
Whiteboard¶
Client Region B (warm standby)
| ^
v |
[Edge: TLS, auth, abuse] [Region failover, BGP/DNS]
|
v
[Load balancer: L7, header-aware] -- routes by user_id sticky for KV-cache reuse
|
v
[API gateway: rate-limit, billing meter]
|
v
[Conversation service]
| \
| \--> [KV-cache reuse store: per-user warm prefix on Redis-tier]
v
[LLM inference fleet: vLLM, continuous batching, paged attention]
| | |
GPU GPU GPU ... (sharded by model replica, ~250 replicas of TP=8 70B)
|
v
[Streaming: SSE / WebSocket back to client]
[Async sink: logs, prompt/response store, eval sampling]
Failure modes¶
- GPU OOM on long prompt: mitigate with max-context enforcement at gateway; reject 422 before scheduling.
- Hot shard (one user spamming): rate-limit at gateway; backpressure; per-tenant queue isolation.
- Cold start: keep N+2 warm replicas per region; pre-warm KV cache for top 1% of users.
- Queue saturation: Little's law —
L = lambda * W. If avg request takes 8s and we have 1000 in-flight slots, max sustained QPS is 125 per replica. Plan with peaks. - Cascading retry storms: exponential backoff with jitter; circuit breaker at gateway.
- NCCL deadlock during tensor-parallel inference: watchdog + replica restart.
Cost discipline (CpQU)¶
- 3500 H100s × \(2/hr × 24h × 30d = ~\)15M/month at on-demand pricing.
- Optimization levers: (1) quantize to int8 / fp8 → 2x throughput, ~$7.5M/month, small quality hit; (2) GQA / MQA model → smaller KV cache → larger batches; (3) speculative decoding with a draft model → 2-3x throughput; (4) prompt caching (Anthropic-style) → recycle prefix compute for warm users.
- CpQU framing: measure quality (e.g. preference rate vs reference) per dollar; pick the design that maximizes quality/dollar at the SLA.
→ Phase 33, Phase 34, drill 12.
Prompt 2 — Design a coding-assistant agent with tool use¶
Clarifying questions¶
- Tools available? Assume: filesystem read/write, shell exec (sandboxed), web search, code interpreter.
- Concurrency per user? Assume 1 active agent per user (no parallel sub-agents at v1).
- Latency? P50 < 30s end-to-end (multi-step tasks); streaming visible.
- Safety? Sandbox must contain shell exec; no arbitrary network access.
Capacity math¶
- Each agent turn: 1-5 tool calls; each tool call adds ~500 input tokens (tool spec + result) and 200 output tokens.
- A complex task: 10 turns × 4 tool calls × 700 tokens = ~28k tokens of LLM work per task.
- If a task takes 30s and ~28k tokens, that is ~1000 tokens/sec/active-agent — consistent with a single decoded stream.
- For 100k concurrent active agents: 100k streams. If a vLLM instance handles 32 concurrent streams: ~3000 vLLM instances needed at peak.
Whiteboard¶
Client (IDE)
|
v
[Agent orchestrator]
|
v
[Planner LLM] <----+
| | (loop: think -> act -> observe)
v |
[Tool router] |
/ | \ |
fs shell web | [Trace store: every tool call, every model turn]
| | | |
v v v |
[Sandbox VM per session: ephemeral, network-isolated, FS-jailed]
| |
+---------------+
Failure modes¶
- Infinite loop / runaway: hard cap on tool calls per task (e.g. 50); cap on total tokens; cap on wall time.
- Sandbox escape: seccomp + user namespace + no host volume mounts; gVisor or Firecracker.
- Tool flakes (HTTP 503): retry with backoff; agent should observe failure and adapt — not retry forever.
- Adversarial tool input (prompt injection from web search): never trust tool outputs as instructions; mark provenance; use structured outputs.
- Cost spike from a single user: per-user token budget; alert at 80%.
Cost discipline¶
- Most agent tasks are dominated by repeated LLM context (system prompt + tool spec + history grows each turn).
- Lever: prompt caching — cache the static prefix (system + tools); only re-prefill the dynamic suffix. Anthropic and OpenAI both expose this.
- Lever: smaller "router" model for tool selection; route to large model only for "think" steps.
→ Phase 31, Phase 32, Phase 37.
Prompt 3 — Design a multi-tenant fine-tuning service¶
Clarifying questions¶
- Base models? Assume 3 sizes: 7B, 13B, 70B.
- Fine-tuning method? LoRA only at v1 (cheap, isolatable).
- Customer scale? 10k customers, average 100k training examples each.
- SLA on training start? < 5 min from job submission.
Capacity math¶
- LoRA training on 7B model with 100k examples (batch 8, seq 2048, 3 epochs): roughly 2 H100-hours.
- 70B LoRA on 100k examples: ~20 H100-hours.
- If 100 jobs/hour submitted average, weighted mix: ~200 H100-hours/hour → 200 H100s continuously busy.
- Plus inference fleet for serving the fine-tunes: see prompt 1 with adapter swap.
Whiteboard¶
Customer dashboard
|
v
[API: dataset upload, job submit]
|
v
[Object store: encrypted, per-tenant KMS key]
|
v
[Job queue: priority by tier; per-tenant fairness]
|
v
[Training scheduler]
|
v
[Trainer pods: vLLM-trainer / nanotron / Axolotl]
| (LoRA adapter saved per tenant)
v
[Adapter store: tenant_id -> base_model -> adapter blob]
|
v
[Inference fleet: vLLM + multi-LoRA serving (load N adapters on one base)]
|
v
Customer hits inference endpoint with model_id = (base, adapter_id)
Failure modes¶
- Data contamination across tenants: strict isolation; per-tenant KMS; audit logging.
- Bad data → divergent training: validate before scheduling; surface NaN-loss to the customer with a useful error.
- Adapter hot-loading at inference: vLLM multi-LoRA can serve hundreds of adapters on one base; cap and LRU-evict.
- Resource exhaustion (one tenant monopolizes the queue): per-tenant quota; weighted-fair-queue scheduler.
Cost discipline¶
- LoRA on 7B is ~$10 of compute; price at $50 → 5x margin. 70B is $200 → price at $500.
- Use spot instances for training jobs that can checkpoint cheaply (LoRA: yes, checkpoints are tiny).
→ Phase 28, Phase 38.
Prompt 4 — Design a low-latency inference service for code completion (Copilot-style)¶
Clarifying questions¶
- P99 latency? < 200ms to first character (this is the actual Copilot bar).
- Model size? Assume 7B specialized code model.
- Avg suggestion length? 30 tokens (suggestions are short).
- Cancellation rate? ~70% — users keep typing and cancel.
Capacity math¶
- 30 output tokens × ~30ms/token (single-stream H100, 7B) = 900ms — too slow.
- Need: large batching to share kernel launch overhead, and speculative decoding for low single-stream latency.
- With speculative decoding (draft model 1B, target 7B, acceptance rate 0.7), effective tokens/sec/stream ~120 → 30 tokens in 250ms. Tighter with continuous batching.
- Cancellation kills 70% of decodes mid-flight — preempt aggressively; don't waste compute on cancelled streams.
Whiteboard¶
IDE plugin -- debounces keystrokes (150ms)
|
v
[Edge gateway: cancellation-aware]
|
v
[Inference fleet: vLLM with speculative decoding]
|
v
[Result streamed token-by-token; client cancels on next keystroke]
|
+--> [Cancellation propagates to scheduler immediately; abort decode]
Failure modes¶
- Cancellation race: new keystroke arrives while old request is mid-flight; cancellation token; idempotent client.
- Cache pollution: same prefix typed by many users — share KV cache across users iff prompt does not contain PII (security gate).
- Quality regression after model update: shadow traffic; A/B with offline eval on internal repos.
- Privacy: no logging of user code by default; opt-in for training data.
Cost discipline¶
- Aggressive batching reduces $/token by 10x at the cost of latency variance.
- Speculative decoding is a net win on $/token and latency for code (high acceptance rate in syntactic contexts).
- Model distillation: a 1B model + speculative may beat a 7B model on cost-per-acceptance.
→ Phase 22, Phase 27, Phase 33.
Prompt 5 — Design an evaluation harness for a frontier-model release¶
Clarifying questions¶
- What kind of release? Major version bump; full eval suite.
- How many evals? 100+ benchmarks plus internal red-team.
- Compute budget? 10k GPU-hours.
- Turnaround SLA? 24 hours from candidate model to go/no-go report.
Capacity math¶
- Eval set size: 100k prompts across benchmarks.
- Average output: 500 tokens (some long-form, some short).
- 100k prompts × 500 tokens = 50M output tokens at 1000 tokens/sec/GPU → 50k GPU-seconds → 14 GPU-hours per eval pass.
- Plus 100x for ablations and seeds → 1400 GPU-hours per eval suite. Fits the budget.
- LLM-as-judge for pairwise: 100k pairs × 2 calls (positional swap) × ~1k tokens = 200M judge tokens. Use Claude / GPT-4 API or internal judge model.
Whiteboard¶
[Model candidate registry]
|
v
[Eval orchestrator: forks one job per benchmark]
|
+--> Capabilities (MMLU, HumanEval, MATH, GPQA, ...)
+--> Safety (red-team prompts, refusal calibration)
+--> Alignment (constitutional adherence, sycophancy)
+--> Robustness (paraphrases, adversarial suffixes)
|
v
[Result aggregator]
|
v
[Report generator: html + JSON + dashboards]
|
v
[Human reviewer: go / no-go meeting]
Failure modes¶
- Eval set leakage: keep the private eval set air-gapped; rotate quarterly.
- LLM judge bias: position bias (swap), verbosity bias (length normalize), self-preference (use a different judge model).
- Flaky evals: sample size large enough that the 95% CI excludes the previous release's score if you mean to claim a regression.
- Goodhart on benchmarks: never train on the eval set; spot-check overlap with training corpus; pre-register the "primary" metric before running.
Cost discipline¶
- Cheap evals first (perplexity on held-out, small benchmarks): kill bad candidates early.
- Expensive evals (red-team, long-form judge) only on candidates that pass cheap gates.
- Reuse generations: judge multiple criteria from one generation.
→ Phase 20, Phase 37.
What an interviewer looks for¶
| Behavior | Signal |
|---|---|
| Asks clarifying questions before drawing | + senior |
| Capacity math on a napkin | + has seen production |
| Names a specific failure mode unprompted | + experienced |
| Talks about cost not just feasibility | + ready for a real role |
| Draws a 50-box diagram with no numbers | - shallow |
| Says "we'd use Kubernetes" without context | - cargo-cult |
| Never mentions failure modes | - never run a thing |
→ Next: 03-paper-read-drill.md