English · Español

02 — Systems Design for LLMs: 5 Whiteboard Prompts¶

🇪🇸 5 prompts de systems design para LLMs en producción. Cada uno con pizarra, matemáticas de capacidad (Little's law, presupuesto de GPU), modos de fallo y disciplina de costos (CpQU del Phase 38).

How to attack a systems-design prompt¶

Clarify before drawing. "10M DAU" — what is the latency target? P50 or P99? Streaming or batch? What is the budget? "Don't know" is a valid answer from the interviewer; pin assumptions and proceed.
Capacity first. Tokens/sec needed → GPU count → memory needed → cost per request. Most candidates dive into K8s diagrams without this math; that is the failure signal.
Failure modes second. Enumerate at least 4: model OOM, queue saturation, hot-shard, network partition, quota exhaustion. Lab interviewers love this.
Cost discipline last. Pick the cheapest design that meets the SLA. "Cost per quality unit" (CpQU, Phase 38) is the framing.

Prompt 1 — Design a chatbot serving 10M DAU¶

Clarifying questions (ask these out loud)¶

DAU vs MAU? Assume 10M DAU.
Avg conversations per user per day? Assume 5.
Avg turns per conversation? Assume 8.
Avg tokens per turn? Input 200, output 300.
Latency SLA? P50 < 1s to first token, P99 < 3s; full response < 15s.
Model size? Assume 70B dense.

Capacity math¶

Total turns/day: 10M * 5 * 8 = 400M turns/day.
Peak QPS (assume 3x average for daily peak): 400M / 86400 * 3 ≈ 14k QPS.
Tokens per turn: 200 in + 300 out = 500 total. Output tokens dominate cost (decode is sequential, prefill is parallel).
Total output tokens/sec at peak: 14k * 300 = 4.2M out tokens/sec.
A single H100 at 70B in fp16 generates roughly 30-50 tokens/sec single-stream; with continuous batching (vLLM, batch ~32 effective) you get 1000-2000 tokens/sec/GPU.
GPUs needed for decode: 4.2M / 1500 ≈ 2800 H100s. Round up for headroom → 3500 GPUs.

Whiteboard¶

        Client                                    Region B (warm standby)
           |                                              ^
           v                                              |
   [Edge: TLS, auth, abuse]                   [Region failover, BGP/DNS]
           |
           v
   [Load balancer: L7, header-aware] -- routes by user_id sticky for KV-cache reuse
           |
           v
   [API gateway: rate-limit, billing meter]
           |
           v
   [Conversation service]
        |        \
        |         \--> [KV-cache reuse store: per-user warm prefix on Redis-tier]
        v
   [LLM inference fleet: vLLM, continuous batching, paged attention]
        |     |     |
       GPU   GPU   GPU  ... (sharded by model replica, ~250 replicas of TP=8 70B)
        |
        v
   [Streaming: SSE / WebSocket back to client]
   [Async sink: logs, prompt/response store, eval sampling]

Failure modes¶

GPU OOM on long prompt: mitigate with max-context enforcement at gateway; reject 422 before scheduling.
Hot shard (one user spamming): rate-limit at gateway; backpressure; per-tenant queue isolation.
Cold start: keep N+2 warm replicas per region; pre-warm KV cache for top 1% of users.
Queue saturation: Little's law — L = lambda * W. If avg request takes 8s and we have 1000 in-flight slots, max sustained QPS is 125 per replica. Plan with peaks.
Cascading retry storms: exponential backoff with jitter; circuit breaker at gateway.
NCCL deadlock during tensor-parallel inference: watchdog + replica restart.

Cost discipline (CpQU)¶

3500 H100s × $2/hr × 24h × 30d = ~$15M/month at on-demand pricing.
Optimization levers: (1) quantize to int8 / fp8 → 2x throughput, ~$7.5M/month, small quality hit; (2) GQA / MQA model → smaller KV cache → larger batches; (3) speculative decoding with a draft model → 2-3x throughput; (4) prompt caching (Anthropic-style) → recycle prefix compute for warm users.
CpQU framing: measure quality (e.g. preference rate vs reference) per dollar; pick the design that maximizes quality/dollar at the SLA.

→ Phase 33, Phase 34, drill 12.

Prompt 2 — Design a coding-assistant agent with tool use¶

Clarifying questions¶

Tools available? Assume: filesystem read/write, shell exec (sandboxed), web search, code interpreter.
Concurrency per user? Assume 1 active agent per user (no parallel sub-agents at v1).
Latency? P50 < 30s end-to-end (multi-step tasks); streaming visible.
Safety? Sandbox must contain shell exec; no arbitrary network access.

Capacity math¶

Each agent turn: 1-5 tool calls; each tool call adds ~500 input tokens (tool spec + result) and 200 output tokens.
A complex task: 10 turns × 4 tool calls × 700 tokens = ~28k tokens of LLM work per task.
If a task takes 30s and ~28k tokens, that is ~1000 tokens/sec/active-agent — consistent with a single decoded stream.
For 100k concurrent active agents: 100k streams. If a vLLM instance handles 32 concurrent streams: ~3000 vLLM instances needed at peak.

Whiteboard¶

   Client (IDE)
      |
      v
   [Agent orchestrator]
      |
      v
   [Planner LLM]  <----+
      |               |  (loop: think -> act -> observe)
      v               |
   [Tool router]      |
   / | \              |
  fs shell web        |  [Trace store: every tool call, every model turn]
  |   |   |           |
  v   v   v           |
 [Sandbox VM per session: ephemeral, network-isolated, FS-jailed]
      |               |
      +---------------+

Failure modes¶

Infinite loop / runaway: hard cap on tool calls per task (e.g. 50); cap on total tokens; cap on wall time.
Sandbox escape: seccomp + user namespace + no host volume mounts; gVisor or Firecracker.
Tool flakes (HTTP 503): retry with backoff; agent should observe failure and adapt — not retry forever.
Adversarial tool input (prompt injection from web search): never trust tool outputs as instructions; mark provenance; use structured outputs.
Cost spike from a single user: per-user token budget; alert at 80%.

Cost discipline¶

Most agent tasks are dominated by repeated LLM context (system prompt + tool spec + history grows each turn).
Lever: prompt caching — cache the static prefix (system + tools); only re-prefill the dynamic suffix. Anthropic and OpenAI both expose this.
Lever: smaller "router" model for tool selection; route to large model only for "think" steps.

→ Phase 31, Phase 32, Phase 37.

Prompt 3 — Design a multi-tenant fine-tuning service¶

Clarifying questions¶

Base models? Assume 3 sizes: 7B, 13B, 70B.
Fine-tuning method? LoRA only at v1 (cheap, isolatable).
Customer scale? 10k customers, average 100k training examples each.
SLA on training start? < 5 min from job submission.

Capacity math¶

LoRA training on 7B model with 100k examples (batch 8, seq 2048, 3 epochs): roughly 2 H100-hours.
70B LoRA on 100k examples: ~20 H100-hours.
If 100 jobs/hour submitted average, weighted mix: ~200 H100-hours/hour → 200 H100s continuously busy.
Plus inference fleet for serving the fine-tunes: see prompt 1 with adapter swap.

Whiteboard¶

   Customer dashboard
        |
        v
   [API: dataset upload, job submit]
        |
        v
   [Object store: encrypted, per-tenant KMS key]
        |
        v
   [Job queue: priority by tier; per-tenant fairness]
        |
        v
   [Training scheduler]
        |
        v
   [Trainer pods: vLLM-trainer / nanotron / Axolotl]
        |          (LoRA adapter saved per tenant)
        v
   [Adapter store: tenant_id -> base_model -> adapter blob]
        |
        v
   [Inference fleet: vLLM + multi-LoRA serving (load N adapters on one base)]
        |
        v
   Customer hits inference endpoint with model_id = (base, adapter_id)

Failure modes¶

Data contamination across tenants: strict isolation; per-tenant KMS; audit logging.
Bad data → divergent training: validate before scheduling; surface NaN-loss to the customer with a useful error.
Adapter hot-loading at inference: vLLM multi-LoRA can serve hundreds of adapters on one base; cap and LRU-evict.
Resource exhaustion (one tenant monopolizes the queue): per-tenant quota; weighted-fair-queue scheduler.

Cost discipline¶

LoRA on 7B is ~$10 of compute; price at $50 → 5x margin. 70B is $200 → price at $500.
Use spot instances for training jobs that can checkpoint cheaply (LoRA: yes, checkpoints are tiny).

→ Phase 28, Phase 38.

Prompt 4 — Design a low-latency inference service for code completion (Copilot-style)¶

Clarifying questions¶

P99 latency? < 200ms to first character (this is the actual Copilot bar).
Model size? Assume 7B specialized code model.
Avg suggestion length? 30 tokens (suggestions are short).
Cancellation rate? ~70% — users keep typing and cancel.

Capacity math¶

30 output tokens × ~30ms/token (single-stream H100, 7B) = 900ms — too slow.
Need: large batching to share kernel launch overhead, and speculative decoding for low single-stream latency.
With speculative decoding (draft model 1B, target 7B, acceptance rate 0.7), effective tokens/sec/stream ~120 → 30 tokens in 250ms. Tighter with continuous batching.
Cancellation kills 70% of decodes mid-flight — preempt aggressively; don't waste compute on cancelled streams.

Whiteboard¶

   IDE plugin -- debounces keystrokes (150ms)
      |
      v
   [Edge gateway: cancellation-aware]
      |
      v
   [Inference fleet: vLLM with speculative decoding]
        |
        v
   [Result streamed token-by-token; client cancels on next keystroke]
        |
        +--> [Cancellation propagates to scheduler immediately; abort decode]

Failure modes¶

Cancellation race: new keystroke arrives while old request is mid-flight; cancellation token; idempotent client.
Cache pollution: same prefix typed by many users — share KV cache across users iff prompt does not contain PII (security gate).
Quality regression after model update: shadow traffic; A/B with offline eval on internal repos.
Privacy: no logging of user code by default; opt-in for training data.

Cost discipline¶

Aggressive batching reduces $/token by 10x at the cost of latency variance.
Speculative decoding is a net win on $/token and latency for code (high acceptance rate in syntactic contexts).
Model distillation: a 1B model + speculative may beat a 7B model on cost-per-acceptance.

→ Phase 22, Phase 27, Phase 33.

Prompt 5 — Design an evaluation harness for a frontier-model release¶

Clarifying questions¶

What kind of release? Major version bump; full eval suite.
How many evals? 100+ benchmarks plus internal red-team.
Compute budget? 10k GPU-hours.
Turnaround SLA? 24 hours from candidate model to go/no-go report.

Capacity math¶

Eval set size: 100k prompts across benchmarks.
Average output: 500 tokens (some long-form, some short).
100k prompts × 500 tokens = 50M output tokens at 1000 tokens/sec/GPU → 50k GPU-seconds → 14 GPU-hours per eval pass.
Plus 100x for ablations and seeds → 1400 GPU-hours per eval suite. Fits the budget.
LLM-as-judge for pairwise: 100k pairs × 2 calls (positional swap) × ~1k tokens = 200M judge tokens. Use Claude / GPT-4 API or internal judge model.

Whiteboard¶

   [Model candidate registry]
        |
        v
   [Eval orchestrator: forks one job per benchmark]
        |
        +--> Capabilities (MMLU, HumanEval, MATH, GPQA, ...)
        +--> Safety (red-team prompts, refusal calibration)
        +--> Alignment (constitutional adherence, sycophancy)
        +--> Robustness (paraphrases, adversarial suffixes)
        |
        v
   [Result aggregator]
        |
        v
   [Report generator: html + JSON + dashboards]
        |
        v
   [Human reviewer: go / no-go meeting]

Failure modes¶

Eval set leakage: keep the private eval set air-gapped; rotate quarterly.
LLM judge bias: position bias (swap), verbosity bias (length normalize), self-preference (use a different judge model).
Flaky evals: sample size large enough that the 95% CI excludes the previous release's score if you mean to claim a regression.
Goodhart on benchmarks: never train on the eval set; spot-check overlap with training corpus; pre-register the "primary" metric before running.

Cost discipline¶

Cheap evals first (perplexity on held-out, small benchmarks): kill bad candidates early.
Expensive evals (red-team, long-form judge) only on candidates that pass cheap gates.
Reuse generations: judge multiple criteria from one generation.

→ Phase 20, Phase 37.

What an interviewer looks for¶

Behavior	Signal
Asks clarifying questions before drawing	+ senior
Capacity math on a napkin	+ has seen production
Names a specific failure mode unprompted	+ experienced
Talks about cost not just feasibility	+ ready for a real role
Draws a 50-box diagram with no numbers	- shallow
Says "we'd use Kubernetes" without context	- cargo-cult
Never mentions failure modes	- never run a thing

→ Next: 03-paper-read-drill.md