English · Español

03 — Little's law, queue depth, and capacity sizing¶

🇪🇸 Little's law es una ecuación de tres letras que ata throughput, latencia y profundidad de cola. Sin ella, dimensionar un servicio es adivinar. Con ella, es aritmética.

Little's law¶

For any queueing system in steady state with arrival rate \(\lambda\) (requests per second) and average time-in-system \(W\) (seconds per request):

\[L = \lambda \cdot W\]

where \(L\) is the average number of requests in the system. The derivation is purely combinatorial — no assumptions about the distribution of arrivals or service times. As long as the system is stable (input rate ≤ capacity), this holds.

Why it matters for inference serving¶

You have three knobs:

Arrival rate \(\lambda\): how many requests/s come in. Set by your traffic.
Time-in-system \(W\): end-to-end latency per request (queue wait + service). What users feel.
Concurrency \(L\): how many requests are "alive" inside the system.

Little's law says you only get to pick two. The third follows.

Example. The tutor takes \(W_\text{service} = 200\) ms per request on average. You want \(\lambda = 50\) req/s. By Little's law (with \(W = W_\text{service}\) in the best case, ignoring queue wait):

\[L = 50 \cdot 0.2 = 10\]

You need 10 requests in flight to sustain 50 req/s. If your server's MAX_INFLIGHT is 4, your steady-state can't reach 50 req/s — requests pile up in the queue, \(W\) grows, and \(L\) grows until you OOM or the load balancer times out.

The bound: \(L \le L_\text{max}\)¶

The system has a hard capacity \(L_\text{max}\) — for our scheduler, it's the MAX_INFLIGHT parameter. From Little's law:

\[\lambda_\text{max} = \frac{L_\text{max}}{W}\]

If \(W = 200\) ms and \(L_\text{max} = 8\), then \(\lambda_\text{max} = 40\) req/s. Anything above that and the queue grows without bound.

This is the capacity ceiling. Plot it: \(\lambda\) vs \(W\) for fixed \(L_\text{max}\) — you'll see a knee at \(\lambda_\text{max}\) where latency goes vertical.

What goes wrong when you exceed capacity¶

If \(\lambda > \lambda_\text{max}\):

The ready queue grows linearly with time.
Each request's \(W\) = (its queue wait) + (its service time). Queue wait grows linearly with queue length.
Memory usage grows linearly with queue length (each request is a few KB + its KV cache).
Eventually: timeout cascade, OOM, or load balancer starts shedding.

This is why production systems implement admission control — reject (with 503) or shed requests when the queue is too deep. Better to reject 10% of requests than to make 100% of users wait 30 seconds.

For lab 03: add a MAX_QUEUE_DEPTH parameter; if exceeded, return HTTP 503.

A subtle distinction: average vs p95¶

Little's law uses averages. But users care about tails. If your mean \(W = 100\) ms but the p95 is 1500 ms, the user experience is bad even though throughput looks fine.

Continuous batching helps p95 specifically because it lets fast requests leave early — they don't get stuck behind slow ones. Static batching keeps the average roughly the same but ruins p95.

Operational rule of thumb: Size \(L_\text{max}\) for your p95 service time, not your mean.

Backpressure: telling the client to slow down¶

When the queue is near full, you have three options:

Block: Accept the request, make the client wait in the queue. Bad for memory.
Reject: Return HTTP 503 (Service Unavailable) with a Retry-After header. Pushes the problem upstream — to a load balancer or to the client's retry logic.
Degrade: Return a faster, lower-quality answer (e.g., skip the explanation, just return the corrected sentence). Domain-specific.

For phase 33: implement option 2. It's the right default.

Health checks¶

Two endpoints, both required:

/healthz (liveness): "Is the process alive?" — returns 200 always (or 503 if the process is fundamentally broken). The orchestrator uses this to decide whether to restart the process.
/readyz (readiness): "Are you ready to take traffic?" — returns 200 only if the model is loaded AND the queue depth is below a threshold. The load balancer uses this to decide whether to send traffic.

If /readyz returns 503, the load balancer routes traffic to other replicas. This is passive backpressure — better than explicit rejection because it's transparent to clients.

A useful formula: target queue depth¶

You want to choose MAX_INFLIGHT such that: - Throughput target is met: MAX_INFLIGHT >= target_rps * mean_service_time - Memory budget respected: MAX_INFLIGHT * mem_per_request <= available_memory - Tail latency target met: choose MAX_INFLIGHT from a load test, not a formula

Lab 03 will sweep MAX_INFLIGHT ∈ {1, 2, 4, 8, 16, 32} and pick the elbow of the throughput-vs-latency curve. This is the engineering work that Little's law motivates but doesn't determine.

A note on units¶

Be careful: \(W\) in Little's law is the end-to-end time-in-system (queue wait + service time). Not just the model forward time. If you confuse them, you'll over-provision.

When the queue is empty: \(W \approx W_\text{service}\). When the queue has \(k\) requests ahead of you: \(W \approx W_\text{service} \cdot (k + 1) / B\) (where \(B\) is the effective batch size). Continuous batching makes \(k\) shrink fast for short requests.

What this file does NOT cover¶

M/M/1 and M/M/c queue analysis with exponential distributions. The mean-only form of Little's law is enough.
Autoscaling — when do you add more replicas? Phase 34.
Burst handling — token bucket, leaky bucket. Phase 37.

Next: ../lab/00-minimal-fastapi.md