Skip to content

English · Español

Lab 00 — Derive Cache Size on Paper

Goal: confirm the formula bytes = 2 · L · H · d_h · S · B · s has landed before touching code.

Estimated time: 60–90 minutes.

Prereq: theory/00..04.md read. Spec config for Phase-17 MiniGPT confirmed (re-check src/minimodel/README.md).


What you produce

A markdown file experiments/22-cache-sizing/derivations.md containing:

  • Six worked derivations (one per row of the scaling table in theory/02-memory-cost.md).
  • Two "reverse" derivations (given a memory budget, find \(S_\text{max}\)).
  • One numbered "things I got wrong on the first try" section. (If empty, you didn't do enough; redo with attention.)

No code in this lab. Use a calculator if you need it. Cite each number from the spec config — Phase-17 README.md for MiniGPT, public configs for GPT-2 / Llama-2 / GPT-3.

TODOs

Block A — derive forward

For each model in the table below, compute:

Model \(L\) \(H\) \(d_h\) dtype Per-token 4k ctx 32k ctx
MiniGPT (from Phase-17 config) fp32 __ __ __
GPT-2 small 12 12 64 fp16 __ __ __
Llama-2-7B 32 32 128 fp16 __ __ __
Llama-2-13B 40 40 128 fp16 __ __ __
Llama-2-70B (MHA) 80 64 128 fp16 __ __ __
GPT-3 175B 96 96 128 fp16 __ __ __

Fill in the per-token, 4k, and 32k columns. Show your arithmetic on the page, e.g.:

Llama-2-7B per-token: 2 * 32 * 32 * 128 * 1 * 2 = 524288 bytes = 512 KiB. ✓

Block B — reverse derive

  1. A100 40 GB, Llama-2-7B fp16, model weights 14 GiB, batch 1. What's \(S_\text{max}\)? What's the answer if batch = 8?
  2. H100 80 GB, GPT-3-175B fp16, model weights 350 GiB. Trick: weights don't fit on one GPU. How many H100s do you need just to hold weights at fp16? At int8?

Block C — verify a GQA savings claim

A papers-with-code blog post claims "Mistral-7B saves 4× cache memory vs Llama-2-7B by using GQA with \(H_{KV} = 8\) instead of 32 heads for K, V". Verify:

  1. Same per-token formula but \(H \to H_{KV}\) for the cache: \(\Delta = 2 \cdot L \cdot H_{KV} \cdot d_h \cdot B \cdot s\).
  2. Ratio of Mistral-7B cache to Llama-2-7B cache, same context, same batch?
  3. Is the "4×" claim exactly right, an approximation, or wrong? Show the math.

Block D — pitfalls (one-paragraph each)

Write a short paragraph on each pitfall, in your own words, with a counter-example or a number:

  1. The factor of 2. Why is it there? Where does it come from in theory/02?
  2. \(d\) vs \(H \cdot d_h\). When does the difference matter (hint: GQA)?
  3. fp32 vs fp16. Is the cache always the same dtype as the model? Why does HuggingFace let you mix?
  4. Pre-allocation waste. For Llama-2-7B with \(S_\text{max} = 32768\) but average actual length \(1024\), what fraction of cache memory is wasted? At batch 32, what's the wasted GiB?

Stop conditions

Done when:

  1. All six rows of Block A filled with shown arithmetic.
  2. Both Block B reverse problems solved with arithmetic.
  3. Block C verified with a yes/no/why.
  4. Block D's four paragraphs written.
  5. The "things I got wrong on the first try" section has at least one entry (otherwise: redo more carefully).
  6. File committed at experiments/22-cache-sizing/derivations.md.

Pitfalls (read before debugging)

  • Bytes vs bits. \(s = 2\) for fp16, not 16. The 16 is the bit count.
  • GiB vs GB. "2 GiB" = \(2 \cdot 2^{30}\) bytes; "2 GB" = \(2 \cdot 10^9\). The difference (7%) matters when you compare to vendor specs (which use GB).
  • Batch confusion. A "batch of 32 sequences" means \(B = 32\). Each has its own cache. Total cache \(\times 32\).
  • GQA on the wrong axis. GQA shares K, V across query heads. The cache uses \(H_{KV}\), the attention compute uses \(H\). People confuse the two constantly.

When to consult solutions/

After all five stop conditions are met. The solution at solutions/00-derive-cache-size-ref.md (written at phase open) compares your numbers and flags any arithmetic error.


Next lab: lab/01-implement-cache.md.