English · Español

Lab 03 — Annotated Read of vLLM's `block_manager.py`¶

Goal: read the actual PagedAttention KV-cache allocator from upstream vLLM and produce an annotated commentary in your own words.

Estimated time: 4–6 hours (mostly reading + writing notes; no coding).

Prereq: theory 03 read; you can describe PagedAttention without notes.

What you produce¶

docs/phase-27-modern-attention/notes/vllm-block-manager-annotated.md containing:

The full text of vllm/core/block_manager.py (or the relevant subset) paraphrased — do not paste the file verbatim (license; also, paraphrasing is the work).
Section-by-section commentary in your own words.
Five answers to the questions below.

experiments/27-paged-attn-reading/manifest.json recording: the vLLM version you read, the commit SHA, the date.

A short README.md in the experiments directory pointing to the notes.

Procedure¶

Step 1 — find the file¶

Clone vLLM at a specific commit (record the SHA). Don't read main; pin to a release.
Locate vllm/core/block_manager.py and vllm/block.py.

Step 2 — read for structure (1 hour)¶

Identify the main classes. Typically: BlockTable, BlockAllocator, BlockManager.
For each class: what state does it hold? What's the public API (methods called from outside)?
Draw a sequence diagram (text-based mermaid in your notes file) showing the lifecycle of a request from add_request to free_request.

Step 3 — read for semantics (2–3 hours)¶

For each public method, annotate:

What it does in one sentence.
What invariants it preserves (e.g., "the page table never points to a freed page").
What could go wrong and how the code handles it.

Step 4 — answer the five questions¶

In your notes file, with citations to specific line numbers / functions:

Page allocation algorithm. Is it free-list-based, bitmap-based, or something else? Why is the chosen structure appropriate for the access pattern?
Eviction policy. When out of pages, which sequences get evicted? LRU? Priority-based? What happens to a sequence whose page is evicted mid-generation?
Copy-on-write trigger. When does it happen? What's the actual mechanism — does it copy the page eagerly or lazily?
Page table representation. Per-sequence list of physical block IDs? A more clever structure? How is the lookup latency kept low?
Interface to the attention kernel. How does the GPU kernel actually access K/V given the page table? (You don't need to dig into the CUDA kernel — just the Python-side handoff.)

Step 5 — connect back to theory 03¶

In a closing paragraph:

Confirm or correct each claim from theory 03 against the actual code.
Note one detail in the implementation that theory 03 didn't mention (there will be some).

Stop conditions¶

The annotated file is ≥ 1500 words of your prose (not pasted code).
All five questions have answers with line-number citations.
The connection-back paragraph identifies at least one theory-03 simplification.

Constraints¶

Don't copy-paste vLLM code verbatim. Paraphrase: rewrite logic in your own words. Code snippets ≤ 5 lines are OK for illustration.
Cite the commit SHA you read. The vLLM API evolves; future readers need to know which version you were looking at.
No need to run vLLM. This is a reading + writing lab. (If you have time and the cloud GPU, running vLLM with a debugger to step through add_request is a great optional exercise.)

Pitfalls¶

Reading without taking notes. It will all blur together. Take notes as you go, not at the end.
Skipping the parts that are "just engineering". Those are often where the cleverness lives. (Page-table indirection, alignment, the swap-out logic for KV pages → CPU memory under pressure — all are non-trivial.)
Confusing PagedAttention (the kernel) with PagedAttention (the allocator). This lab is the latter. The kernel is in vllm/attention/ops/paged_attn.py; we mention it but don't read it line-by-line.

When to consult `solutions/`¶

The "solution" here is solutions/03-paged-attn-reading-ref.md (phase open) — Claude's own annotated read of the same file at the same commit. Compare structure and check whether you missed anything important. Do not consult before writing your own; the value of this lab is the act of reading.

Next lab: lab/04-mqa-gqa.md.

Lab 03 — Annotated Read of vLLM's block_manager.py¶