English · Español

Break 00 — Serve without batching, watch throughput collapse¶

🇪🇸 La forma más rápida de sentir el valor del batching es quitarlo. Este /break desactiva el BatchingScheduler y deja que cada petición use una matmul propia. Bajo C=1 todo va bien; bajo C=8 la cola crece sin freno y la p95 se dispara.

What you'll do¶

Disable the batching layer in the inference server so every request runs its own forward pass, end-to-end serialized on the NumPy BLAS thread. Drive the server with the lab 02 load generator and observe throughput collapse.

Step 1 — Locate the scheduler¶

src/miniserve/scheduler.py        # the BatchingScheduler class (Phase 33 lab 02)
src/miniserve/handlers.py         # the /correct handler that submits to it

Step 2 — Introduce the bug (≤ 5 lines)¶

In src/miniserve/handlers.py, change the submission path so the handler bypasses the scheduler and calls model.forward() directly:

# OLD
result = await scheduler.submit(request_id, prompt_ids)

# NEW (the broken version)
result = model.forward(prompt_ids)   # synchronous, no batch coalescing

This is the smallest possible diff. No imports change. The server still starts. The endpoint still answers.

Step 3 — Record the break¶

learners/borja/phase-33/notes/breaks.md:

- bug-id: 33-01
  concept: continuous batching
  symptom: p50 doubles at C=2, quadruples at C=4, server queue OOMs at C=16.
  hidden_cause: handlers.py bypasses scheduler.submit(); each request serializes
                on the NumPy BLAS thread, no batch coalescing happens.
  hint_1: "Plot p50 vs concurrency. What shape do you see? Linear? Sublinear?"
  hint_2: "Add a log line in BatchingScheduler.submit(). Does it print under load?"
  hint_3: "grep for 'model.forward' in src/miniserve/. Should that call appear there?"
  fix_diff: revert handlers.py — submit to scheduler instead of calling forward directly.

Step 4 — Verify it's observable¶

Run lab 02's loadgen.py with --concurrency 8 --duration 60s. Expected output:

p50:  1180 ms   (target: ≤ 250 ms)
p95:  3400 ms   (target: ≤ 600 ms)
rps:   6.2      (target: ≥ 30 rps)
errors: 12% timeout

The lab 02 tests in tests/phase33/test_throughput.py will go red.

Step 5 — The teaching moment¶

The metric to stare at is p50 vs concurrency. With batching, the curve is roughly logarithmic — the BLAS matmul amortizes work across requests. Without batching, it is linear — each request serializes on the CPU. Borja should be able to derive which one is in the code from the curve shape before reading handlers.py.

When stuck, the hint cascade walks from "look at the data" → "look at the scheduler logs" → "look at the call site". The fix is a one-line revert; the lesson is that "scheduler" in your architecture diagram does nothing unless the handlers route through it.

Hard rules respected¶

Single bug only.
Reversible in 1 line.
Observable from the load test (failing assertion + visibly bad chart).
No security implications.
Tests are not modified.

Next: read ../theory/02-static-vs-continuous-batching.md once the test is green again.