Skip to content

English · Español

Lab 03 — Annotated Read of vLLM's block_manager.py

Goal: read the actual PagedAttention KV-cache allocator from upstream vLLM and produce an annotated commentary in your own words.

Estimated time: 4–6 hours (mostly reading + writing notes; no coding).

Prereq: theory 03 read; you can describe PagedAttention without notes.


What you produce

docs/phase-27-modern-attention/notes/vllm-block-manager-annotated.md containing:

  • The full text of vllm/core/block_manager.py (or the relevant subset) paraphrased — do not paste the file verbatim (license; also, paraphrasing is the work).
  • Section-by-section commentary in your own words.
  • Five answers to the questions below.

experiments/27-paged-attn-reading/manifest.json recording: the vLLM version you read, the commit SHA, the date.

A short README.md in the experiments directory pointing to the notes.

Procedure

Step 1 — find the file

  • Clone vLLM at a specific commit (record the SHA). Don't read main; pin to a release.
  • Locate vllm/core/block_manager.py and vllm/block.py.

Step 2 — read for structure (1 hour)

  • Identify the main classes. Typically: BlockTable, BlockAllocator, BlockManager.
  • For each class: what state does it hold? What's the public API (methods called from outside)?
  • Draw a sequence diagram (text-based mermaid in your notes file) showing the lifecycle of a request from add_request to free_request.

Step 3 — read for semantics (2–3 hours)

For each public method, annotate:

  • What it does in one sentence.
  • What invariants it preserves (e.g., "the page table never points to a freed page").
  • What could go wrong and how the code handles it.

Step 4 — answer the five questions

In your notes file, with citations to specific line numbers / functions:

  1. Page allocation algorithm. Is it free-list-based, bitmap-based, or something else? Why is the chosen structure appropriate for the access pattern?
  2. Eviction policy. When out of pages, which sequences get evicted? LRU? Priority-based? What happens to a sequence whose page is evicted mid-generation?
  3. Copy-on-write trigger. When does it happen? What's the actual mechanism — does it copy the page eagerly or lazily?
  4. Page table representation. Per-sequence list of physical block IDs? A more clever structure? How is the lookup latency kept low?
  5. Interface to the attention kernel. How does the GPU kernel actually access K/V given the page table? (You don't need to dig into the CUDA kernel — just the Python-side handoff.)

Step 5 — connect back to theory 03

In a closing paragraph:

  • Confirm or correct each claim from theory 03 against the actual code.
  • Note one detail in the implementation that theory 03 didn't mention (there will be some).

Stop conditions

  • The annotated file is ≥ 1500 words of your prose (not pasted code).
  • All five questions have answers with line-number citations.
  • The connection-back paragraph identifies at least one theory-03 simplification.

Constraints

  • Don't copy-paste vLLM code verbatim. Paraphrase: rewrite logic in your own words. Code snippets ≤ 5 lines are OK for illustration.
  • Cite the commit SHA you read. The vLLM API evolves; future readers need to know which version you were looking at.
  • No need to run vLLM. This is a reading + writing lab. (If you have time and the cloud GPU, running vLLM with a debugger to step through add_request is a great optional exercise.)

Pitfalls

  • Reading without taking notes. It will all blur together. Take notes as you go, not at the end.
  • Skipping the parts that are "just engineering". Those are often where the cleverness lives. (Page-table indirection, alignment, the swap-out logic for KV pages → CPU memory under pressure — all are non-trivial.)
  • Confusing PagedAttention (the kernel) with PagedAttention (the allocator). This lab is the latter. The kernel is in vllm/attention/ops/paged_attn.py; we mention it but don't read it line-by-line.

When to consult solutions/

The "solution" here is solutions/03-paged-attn-reading-ref.md (phase open) — Claude's own annotated read of the same file at the same commit. Compare structure and check whether you missed anything important. Do not consult before writing your own; the value of this lab is the act of reading.



Next lab: lab/04-mqa-gqa.md.