English · Español

Lab 01 — MLA math exercise (pencil and paper, no model code)¶

Goal: derive Multi-Latent Attention's KV-cache reduction from scratch. Plot the cache-vs-context curve at the grammar-tutor scale vs DeepSeek-V2's scale. Confirm: MLA is the right tool at the right scale, wrong tool at ours.

Estimated time: 2 hours.

Prereq: theory/02-mla.md read; Phase 15 (attention) and Phase 22 (KV cache) understood. Pencil + paper recommended for the derivation; matplotlib for the plots.

What you produce¶

A directory experiments/36-mla-math-exercise/ containing:

derivation.md — your hand-derivation of MLA's KV-cache size, attention reorganization, and parameter overhead.
kv-cache-curves.py — script that plots KV cache size vs context length for: (a) standard MHA at grammar-tutor scale, (b) standard MHA at DeepSeek-V2 scale, © MLA at grammar-tutor scale, (d) MLA at DeepSeek-V2 scale.
kv-cache-curves.png — the plot.
findings.md — short report: at what scale does MLA start to earn its keep?

TODOs¶

Block A — derive MLA's KV-cache size¶

Starting from the standard MHA KV-cache formula:

\[M_{\text{MHA,token}} = 2 \cdot L \cdot H \cdot d_{\text{head}} \cdot \text{bytes\_per\_value}\]

Show in derivation.md:

Step 1. Standard MHA stores both \(K\) and \(V\) per layer, per head. Per-token cost = \(2 L H d_{\text{head}}\) values.
Step 2. MLA introduces \(W_{DKV} \in \mathbb{R}^{d_c \times d_{\text{model}}}\) down-projection. Per token, only \(c_t \in \mathbb{R}^{d_c}\) is cached. Per-token cost = \(L \cdot d_c\) values.
Step 3. Account for the decoupled-rotary K component \(k_t^{\text{rope}} \in \mathbb{R}^{d_h^R}\) that is still cached separately. Per-token cost becomes \(L \cdot (d_c + d_h^R)\).
Step 4. Compute the ratio \(M_{\text{MHA}} / M_{\text{MLA}}\) as a function of \(H, d_{\text{head}}, d_c, d_h^R\). Show the algebraic simplification.

Worked example to fill in:

Scale	\(L\)	\(H\)	\(d_{\text{head}}\)	\(d_c\)	\(d_h^R\)	\(M_{\text{MHA}}/\text{tok}\)	\(M_{\text{MLA}}/\text{tok}\)	Ratio
Grammar tutor	4	4	16	(try 16, 32, 64)	(try 8, 16)	...	...	...
DeepSeek-V2 (paper)	60	128	128	512	64	...	...	...

Fill in the numbers. Comment on what you see.

Block B — derive the attention reorganization¶

The naive MLA forward (described in theory/02-mla.md) reconstructs \(k_\tau = W_{UK} c_\tau\) at attention time, which would cost \(d_c \cdot d_{\text{model}}\) FLOPs per attended-to token. That's expensive.

The matmul trick: defining \(\hat{q}_t = W_{UK}^\top q_t\) at the current token only, you can rewrite the attention dot product as:

\[q_t^\top k_\tau = q_t^\top W_{UK} c_\tau = (W_{UK}^\top q_t)^\top c_\tau = \hat{q}_t^\top c_\tau\]

So you precompute \(\hat{q}_t\) once per token (cost \(O(d_c \cdot d_{\text{model}})\) at the current token only) and then the attention is over \(c_\tau\) directly (cost \(O(d_c \cdot T)\) instead of \(O(d_{\text{model}} \cdot T)\)). Net: fewer FLOPs per token at decode, not more.

In derivation.md:

Write out the matmul identity above.
Compute attention FLOPs per token, naive (decompressed) MLA vs the reorganized form.
Confirm the reorganized form is strictly better than standard MHA on FLOPs per token at decode, assuming \(d_c < d_{\text{model}}\).

Block C — compute parameter overhead¶

Compare parameter counts:

Standard MHA: \(W_Q, W_K, W_V, W_O\) per layer. Total: \(4 L d_{\text{model}}^2\).
MLA: \(W_Q\) (or its low-rank variant), \(W_{DKV}, W_{UK}, W_{UV}, W_O\), plus the rotary split. Total: \(L \cdot (d_{\text{model}} \cdot (d_c + 2 d_c + d_h^R) + d_{\text{model}}^2)\) approximately.

Show:

Parameter overhead ratio MLA/MHA as a function of \(d_c/d_{\text{model}}\).
At DeepSeek scale (\(d_c/d_{\text{model}} = 512/5120 = 0.1\)): parameter overhead = X%.
At grammar-tutor scale (\(d_c/d_{\text{model}} = 16/64 = 0.25\)): parameter overhead = X%.

The grammar-tutor scale has higher relative parameter overhead because \(d_c\) can't shrink below "useful" — there's a floor.

Block D — plot the curves¶

kv-cache-curves.py produces a single 2-panel figure:

Panel 1: KV cache size (MB) vs context length (tokens), at the grammar-tutor scale. Two lines: MHA, MLA. Annotate the maximum-context point (32 tokens for the tutor).

Panel 2: same plot at DeepSeek-V2 scale. Two lines: MHA, MLA. Annotate at 4k, 32k, 128k tokens.

Commit kv-cache-curves.png.

What to look for:

Panel 1: the two lines are both tiny (kilobytes). MLA's reduction is real but irrelevant.
Panel 2: the gap is dramatic. MHA blows past 50 GB at 128k; MLA stays under 5 GB.

Block E — the findings report¶

findings.md (~300 words):

What's the crossover context length (in MB of KV cache) at which MLA starts to matter?
At the grammar tutor's max context (32 tokens), what's the absolute KV cache size with and without MLA?
What parameter overhead does MLA add at the grammar-tutor scale? Is it worth it?
Verdict: should we add MLA to the grammar tutor? Why or why not?
When would MLA help, hypothetically? (The 5-language-grammar-tutor counterfactual from theory/02.)

Constraints¶

No PyTorch code. This is a math + plotting exercise. Pure NumPy + matplotlib.
No model training. Don't try to "verify" MLA by training one — that's not the lab's goal.
Show your algebra. If derivation.md skips steps, you're not learning. Write each substitution explicitly.

Stop conditions¶

You're done when:

experiments/36-mla-math-exercise/{derivation.md, kv-cache-curves.py, kv-cache-curves.png, findings.md} all exist.
derivation.md walks through all four blocks (KV cache size, attention reorganization, parameter overhead, ratios).
The plot has both panels and both context-length scales.
findings.md answers all five questions.
You can recite, from memory, "MLA reduces KV cache by [ratio]× at [parameters] / [latent dim] / [layers]." That's the magic number that comes out of the algebra.

Hint of last resort¶

If you get stuck on the matmul reorganization (Block B): re-read DeepSeek-V2's section 2.1 of their paper. The reorganization is the key insight; the rest of the paper is implementation. After you've spent 30 minutes wrestling with it, allow yourself to read the paper's exposition.

If your KV-cache ratios don't match the expected ~10–18× at DeepSeek scale: double-check whether you accounted for \(d_h^R\) (the rotary K component cached separately). Forgetting that term gives an over-optimistic ratio.

When to consult `solutions/`¶

After findings.md is committed. Solution lives in solutions/01-mla-math-ref.md — written at phase open. The reference includes the worked numbers at both scales, exactly the table from Block A filled in.

Next lab: lab/02-mamba-walkthrough.md.