English · Español

Lab 00 — Derive Cache Size on Paper¶

Goal: confirm the formula bytes = 2 · L · H · d_h · S · B · s has landed before touching code.

Estimated time: 60–90 minutes.

Prereq: theory/00..04.md read. Spec config for Phase-17 MiniGPT confirmed (re-check src/minimodel/README.md).

What you produce¶

A markdown file experiments/22-cache-sizing/derivations.md containing:

Six worked derivations (one per row of the scaling table in theory/02-memory-cost.md).
Two "reverse" derivations (given a memory budget, find \(S_\text{max}\)).
One numbered "things I got wrong on the first try" section. (If empty, you didn't do enough; redo with attention.)

No code in this lab. Use a calculator if you need it. Cite each number from the spec config — Phase-17 README.md for MiniGPT, public configs for GPT-2 / Llama-2 / GPT-3.

TODOs¶

Block A — derive forward¶

For each model in the table below, compute:

Model	\(L\)	\(H\)	\(d_h\)	dtype	Per-token	4k ctx	32k ctx
MiniGPT	(from Phase-17 config)			fp32	__	__	__
GPT-2 small	12	12	64	fp16	__	__	__
Llama-2-7B	32	32	128	fp16	__	__	__
Llama-2-13B	40	40	128	fp16	__	__	__
Llama-2-70B (MHA)	80	64	128	fp16	__	__	__
GPT-3 175B	96	96	128	fp16	__	__	__

Fill in the per-token, 4k, and 32k columns. Show your arithmetic on the page, e.g.:

Llama-2-7B per-token: 2 * 32 * 32 * 128 * 1 * 2 = 524288 bytes = 512 KiB. ✓

Block B — reverse derive¶

A100 40 GB, Llama-2-7B fp16, model weights 14 GiB, batch 1. What's \(S_\text{max}\)? What's the answer if batch = 8?
H100 80 GB, GPT-3-175B fp16, model weights 350 GiB. Trick: weights don't fit on one GPU. How many H100s do you need just to hold weights at fp16? At int8?

Block C — verify a GQA savings claim¶

A papers-with-code blog post claims "Mistral-7B saves 4× cache memory vs Llama-2-7B by using GQA with \(H_{KV} = 8\) instead of 32 heads for K, V". Verify:

Same per-token formula but \(H \to H_{KV}\) for the cache: \(\Delta = 2 \cdot L \cdot H_{KV} \cdot d_h \cdot B \cdot s\).
Ratio of Mistral-7B cache to Llama-2-7B cache, same context, same batch?
Is the "4×" claim exactly right, an approximation, or wrong? Show the math.

Block D — pitfalls (one-paragraph each)¶

Write a short paragraph on each pitfall, in your own words, with a counter-example or a number:

The factor of 2. Why is it there? Where does it come from in theory/02?
\(d\) vs \(H \cdot d_h\). When does the difference matter (hint: GQA)?
fp32 vs fp16. Is the cache always the same dtype as the model? Why does HuggingFace let you mix?
Pre-allocation waste. For Llama-2-7B with \(S_\text{max} = 32768\) but average actual length \(1024\), what fraction of cache memory is wasted? At batch 32, what's the wasted GiB?

Stop conditions¶

Done when:

All six rows of Block A filled with shown arithmetic.
Both Block B reverse problems solved with arithmetic.
Block C verified with a yes/no/why.
Block D's four paragraphs written.
The "things I got wrong on the first try" section has at least one entry (otherwise: redo more carefully).
File committed at experiments/22-cache-sizing/derivations.md.

Pitfalls (read before debugging)¶

Bytes vs bits. \(s = 2\) for fp16, not 16. The 16 is the bit count.
GiB vs GB. "2 GiB" = \(2 \cdot 2^{30}\) bytes; "2 GB" = \(2 \cdot 10^9\). The difference (7%) matters when you compare to vendor specs (which use GB).
Batch confusion. A "batch of 32 sequences" means \(B = 32\). Each has its own cache. Total cache \(\times 32\).
GQA on the wrong axis. GQA shares K, V across query heads. The cache uses \(H_{KV}\), the attention compute uses \(H\). People confuse the two constantly.

When to consult `solutions/`¶

After all five stop conditions are met. The solution at solutions/00-derive-cache-size-ref.md (written at phase open) compares your numbers and flags any arithmetic error.

Next lab: lab/01-implement-cache.md.