English · Español
Lab 01 — MLA math exercise (pencil and paper, no model code)¶
Goal: derive Multi-Latent Attention's KV-cache reduction from scratch. Plot the cache-vs-context curve at the grammar-tutor scale vs DeepSeek-V2's scale. Confirm: MLA is the right tool at the right scale, wrong tool at ours.
Estimated time: 2 hours.
Prereq:
theory/02-mla.mdread; Phase 15 (attention) and Phase 22 (KV cache) understood. Pencil + paper recommended for the derivation; matplotlib for the plots.
What you produce¶
A directory experiments/36-mla-math-exercise/ containing:
derivation.md— your hand-derivation of MLA's KV-cache size, attention reorganization, and parameter overhead.kv-cache-curves.py— script that plots KV cache size vs context length for: (a) standard MHA at grammar-tutor scale, (b) standard MHA at DeepSeek-V2 scale, © MLA at grammar-tutor scale, (d) MLA at DeepSeek-V2 scale.kv-cache-curves.png— the plot.findings.md— short report: at what scale does MLA start to earn its keep?
TODOs¶
Block A — derive MLA's KV-cache size¶
Starting from the standard MHA KV-cache formula:
Show in derivation.md:
- Step 1. Standard MHA stores both \(K\) and \(V\) per layer, per head. Per-token cost = \(2 L H d_{\text{head}}\) values.
- Step 2. MLA introduces \(W_{DKV} \in \mathbb{R}^{d_c \times d_{\text{model}}}\) down-projection. Per token, only \(c_t \in \mathbb{R}^{d_c}\) is cached. Per-token cost = \(L \cdot d_c\) values.
- Step 3. Account for the decoupled-rotary K component \(k_t^{\text{rope}} \in \mathbb{R}^{d_h^R}\) that is still cached separately. Per-token cost becomes \(L \cdot (d_c + d_h^R)\).
- Step 4. Compute the ratio \(M_{\text{MHA}} / M_{\text{MLA}}\) as a function of \(H, d_{\text{head}}, d_c, d_h^R\). Show the algebraic simplification.
Worked example to fill in:
| Scale | \(L\) | \(H\) | \(d_{\text{head}}\) | \(d_c\) | \(d_h^R\) | \(M_{\text{MHA}}/\text{tok}\) | \(M_{\text{MLA}}/\text{tok}\) | Ratio |
|---|---|---|---|---|---|---|---|---|
| Grammar tutor | 4 | 4 | 16 | (try 16, 32, 64) | (try 8, 16) | ... | ... | ... |
| DeepSeek-V2 (paper) | 60 | 128 | 128 | 512 | 64 | ... | ... | ... |
Fill in the numbers. Comment on what you see.
Block B — derive the attention reorganization¶
The naive MLA forward (described in theory/02-mla.md) reconstructs \(k_\tau = W_{UK} c_\tau\) at attention time, which would cost \(d_c \cdot d_{\text{model}}\) FLOPs per attended-to token. That's expensive.
The matmul trick: defining \(\hat{q}_t = W_{UK}^\top q_t\) at the current token only, you can rewrite the attention dot product as:
So you precompute \(\hat{q}_t\) once per token (cost \(O(d_c \cdot d_{\text{model}})\) at the current token only) and then the attention is over \(c_\tau\) directly (cost \(O(d_c \cdot T)\) instead of \(O(d_{\text{model}} \cdot T)\)). Net: fewer FLOPs per token at decode, not more.
In derivation.md:
- Write out the matmul identity above.
- Compute attention FLOPs per token, naive (decompressed) MLA vs the reorganized form.
- Confirm the reorganized form is strictly better than standard MHA on FLOPs per token at decode, assuming \(d_c < d_{\text{model}}\).
Block C — compute parameter overhead¶
Compare parameter counts:
- Standard MHA: \(W_Q, W_K, W_V, W_O\) per layer. Total: \(4 L d_{\text{model}}^2\).
- MLA: \(W_Q\) (or its low-rank variant), \(W_{DKV}, W_{UK}, W_{UV}, W_O\), plus the rotary split. Total: \(L \cdot (d_{\text{model}} \cdot (d_c + 2 d_c + d_h^R) + d_{\text{model}}^2)\) approximately.
Show:
- Parameter overhead ratio MLA/MHA as a function of \(d_c/d_{\text{model}}\).
- At DeepSeek scale (\(d_c/d_{\text{model}} = 512/5120 = 0.1\)): parameter overhead = X%.
- At grammar-tutor scale (\(d_c/d_{\text{model}} = 16/64 = 0.25\)): parameter overhead = X%.
The grammar-tutor scale has higher relative parameter overhead because \(d_c\) can't shrink below "useful" — there's a floor.
Block D — plot the curves¶
kv-cache-curves.py produces a single 2-panel figure:
Panel 1: KV cache size (MB) vs context length (tokens), at the grammar-tutor scale. Two lines: MHA, MLA. Annotate the maximum-context point (32 tokens for the tutor).
Panel 2: same plot at DeepSeek-V2 scale. Two lines: MHA, MLA. Annotate at 4k, 32k, 128k tokens.
Commit kv-cache-curves.png.
What to look for:
- Panel 1: the two lines are both tiny (kilobytes). MLA's reduction is real but irrelevant.
- Panel 2: the gap is dramatic. MHA blows past 50 GB at 128k; MLA stays under 5 GB.
Block E — the findings report¶
findings.md (~300 words):
- What's the crossover context length (in MB of KV cache) at which MLA starts to matter?
- At the grammar tutor's max context (32 tokens), what's the absolute KV cache size with and without MLA?
- What parameter overhead does MLA add at the grammar-tutor scale? Is it worth it?
- Verdict: should we add MLA to the grammar tutor? Why or why not?
- When would MLA help, hypothetically? (The 5-language-grammar-tutor counterfactual from
theory/02.)
Constraints¶
- No PyTorch code. This is a math + plotting exercise. Pure NumPy + matplotlib.
- No model training. Don't try to "verify" MLA by training one — that's not the lab's goal.
- Show your algebra. If
derivation.mdskips steps, you're not learning. Write each substitution explicitly.
Stop conditions¶
You're done when:
experiments/36-mla-math-exercise/{derivation.md, kv-cache-curves.py, kv-cache-curves.png, findings.md}all exist.derivation.mdwalks through all four blocks (KV cache size, attention reorganization, parameter overhead, ratios).- The plot has both panels and both context-length scales.
findings.mdanswers all five questions.- You can recite, from memory, "MLA reduces KV cache by [ratio]× at [parameters] / [latent dim] / [layers]." That's the magic number that comes out of the algebra.
Hint of last resort¶
If you get stuck on the matmul reorganization (Block B): re-read DeepSeek-V2's section 2.1 of their paper. The reorganization is the key insight; the rest of the paper is implementation. After you've spent 30 minutes wrestling with it, allow yourself to read the paper's exposition.
If your KV-cache ratios don't match the expected ~10–18× at DeepSeek scale: double-check whether you accounted for \(d_h^R\) (the rotary K component cached separately). Forgetting that term gives an over-optimistic ratio.
When to consult solutions/¶
After findings.md is committed. Solution lives in solutions/01-mla-math-ref.md — written at phase open. The reference includes the worked numbers at both scales, exactly the table from Block A filled in.
Next lab: lab/02-mamba-walkthrough.md.