English · Español
Lab 00 — Derive Cache Size on Paper¶
Goal: confirm the formula
bytes = 2 · L · H · d_h · S · B · shas landed before touching code.Estimated time: 60–90 minutes.
Prereq:
theory/00..04.mdread. Spec config for Phase-17 MiniGPT confirmed (re-checksrc/minimodel/README.md).
What you produce¶
A markdown file experiments/22-cache-sizing/derivations.md containing:
- Six worked derivations (one per row of the scaling table in
theory/02-memory-cost.md). - Two "reverse" derivations (given a memory budget, find \(S_\text{max}\)).
- One numbered "things I got wrong on the first try" section. (If empty, you didn't do enough; redo with attention.)
No code in this lab. Use a calculator if you need it. Cite each number from the spec config — Phase-17 README.md for MiniGPT, public configs for GPT-2 / Llama-2 / GPT-3.
TODOs¶
Block A — derive forward¶
For each model in the table below, compute:
| Model | \(L\) | \(H\) | \(d_h\) | dtype | Per-token | 4k ctx | 32k ctx |
|---|---|---|---|---|---|---|---|
| MiniGPT | (from Phase-17 config) | fp32 | __ | __ | __ | ||
| GPT-2 small | 12 | 12 | 64 | fp16 | __ | __ | __ |
| Llama-2-7B | 32 | 32 | 128 | fp16 | __ | __ | __ |
| Llama-2-13B | 40 | 40 | 128 | fp16 | __ | __ | __ |
| Llama-2-70B (MHA) | 80 | 64 | 128 | fp16 | __ | __ | __ |
| GPT-3 175B | 96 | 96 | 128 | fp16 | __ | __ | __ |
Fill in the per-token, 4k, and 32k columns. Show your arithmetic on the page, e.g.:
Block B — reverse derive¶
- A100 40 GB, Llama-2-7B fp16, model weights 14 GiB, batch 1. What's \(S_\text{max}\)? What's the answer if batch = 8?
- H100 80 GB, GPT-3-175B fp16, model weights 350 GiB. Trick: weights don't fit on one GPU. How many H100s do you need just to hold weights at fp16? At int8?
Block C — verify a GQA savings claim¶
A papers-with-code blog post claims "Mistral-7B saves 4× cache memory vs Llama-2-7B by using GQA with \(H_{KV} = 8\) instead of 32 heads for K, V". Verify:
- Same per-token formula but \(H \to H_{KV}\) for the cache: \(\Delta = 2 \cdot L \cdot H_{KV} \cdot d_h \cdot B \cdot s\).
- Ratio of Mistral-7B cache to Llama-2-7B cache, same context, same batch?
- Is the "4×" claim exactly right, an approximation, or wrong? Show the math.
Block D — pitfalls (one-paragraph each)¶
Write a short paragraph on each pitfall, in your own words, with a counter-example or a number:
- The factor of 2. Why is it there? Where does it come from in
theory/02? - \(d\) vs \(H \cdot d_h\). When does the difference matter (hint: GQA)?
- fp32 vs fp16. Is the cache always the same dtype as the model? Why does HuggingFace let you mix?
- Pre-allocation waste. For Llama-2-7B with \(S_\text{max} = 32768\) but average actual length \(1024\), what fraction of cache memory is wasted? At batch 32, what's the wasted GiB?
Stop conditions¶
Done when:
- All six rows of Block A filled with shown arithmetic.
- Both Block B reverse problems solved with arithmetic.
- Block C verified with a yes/no/why.
- Block D's four paragraphs written.
- The "things I got wrong on the first try" section has at least one entry (otherwise: redo more carefully).
- File committed at
experiments/22-cache-sizing/derivations.md.
Pitfalls (read before debugging)¶
- Bytes vs bits. \(s = 2\) for fp16, not 16. The 16 is the bit count.
- GiB vs GB. "2 GiB" = \(2 \cdot 2^{30}\) bytes; "2 GB" = \(2 \cdot 10^9\). The difference (7%) matters when you compare to vendor specs (which use GB).
- Batch confusion. A "batch of 32 sequences" means \(B = 32\). Each has its own cache. Total cache \(\times 32\).
- GQA on the wrong axis. GQA shares K, V across query heads. The cache uses \(H_{KV}\), the attention compute uses \(H\). People confuse the two constantly.
When to consult solutions/¶
After all five stop conditions are met. The solution at solutions/00-derive-cache-size-ref.md (written at phase open) compares your numbers and flags any arithmetic error.
Next lab: lab/01-implement-cache.md.