Skip to content

English · Español

Lab 01 — MLA math exercise (pencil and paper, no model code)

Goal: derive Multi-Latent Attention's KV-cache reduction from scratch. Plot the cache-vs-context curve at the grammar-tutor scale vs DeepSeek-V2's scale. Confirm: MLA is the right tool at the right scale, wrong tool at ours.

Estimated time: 2 hours.

Prereq: theory/02-mla.md read; Phase 15 (attention) and Phase 22 (KV cache) understood. Pencil + paper recommended for the derivation; matplotlib for the plots.


What you produce

A directory experiments/36-mla-math-exercise/ containing:

  • derivation.md — your hand-derivation of MLA's KV-cache size, attention reorganization, and parameter overhead.
  • kv-cache-curves.py — script that plots KV cache size vs context length for: (a) standard MHA at grammar-tutor scale, (b) standard MHA at DeepSeek-V2 scale, © MLA at grammar-tutor scale, (d) MLA at DeepSeek-V2 scale.
  • kv-cache-curves.png — the plot.
  • findings.md — short report: at what scale does MLA start to earn its keep?

TODOs

Block A — derive MLA's KV-cache size

Starting from the standard MHA KV-cache formula:

\[M_{\text{MHA,token}} = 2 \cdot L \cdot H \cdot d_{\text{head}} \cdot \text{bytes\_per\_value}\]

Show in derivation.md:

  • Step 1. Standard MHA stores both \(K\) and \(V\) per layer, per head. Per-token cost = \(2 L H d_{\text{head}}\) values.
  • Step 2. MLA introduces \(W_{DKV} \in \mathbb{R}^{d_c \times d_{\text{model}}}\) down-projection. Per token, only \(c_t \in \mathbb{R}^{d_c}\) is cached. Per-token cost = \(L \cdot d_c\) values.
  • Step 3. Account for the decoupled-rotary K component \(k_t^{\text{rope}} \in \mathbb{R}^{d_h^R}\) that is still cached separately. Per-token cost becomes \(L \cdot (d_c + d_h^R)\).
  • Step 4. Compute the ratio \(M_{\text{MHA}} / M_{\text{MLA}}\) as a function of \(H, d_{\text{head}}, d_c, d_h^R\). Show the algebraic simplification.

Worked example to fill in:

Scale \(L\) \(H\) \(d_{\text{head}}\) \(d_c\) \(d_h^R\) \(M_{\text{MHA}}/\text{tok}\) \(M_{\text{MLA}}/\text{tok}\) Ratio
Grammar tutor 4 4 16 (try 16, 32, 64) (try 8, 16) ... ... ...
DeepSeek-V2 (paper) 60 128 128 512 64 ... ... ...

Fill in the numbers. Comment on what you see.

Block B — derive the attention reorganization

The naive MLA forward (described in theory/02-mla.md) reconstructs \(k_\tau = W_{UK} c_\tau\) at attention time, which would cost \(d_c \cdot d_{\text{model}}\) FLOPs per attended-to token. That's expensive.

The matmul trick: defining \(\hat{q}_t = W_{UK}^\top q_t\) at the current token only, you can rewrite the attention dot product as:

\[q_t^\top k_\tau = q_t^\top W_{UK} c_\tau = (W_{UK}^\top q_t)^\top c_\tau = \hat{q}_t^\top c_\tau\]

So you precompute \(\hat{q}_t\) once per token (cost \(O(d_c \cdot d_{\text{model}})\) at the current token only) and then the attention is over \(c_\tau\) directly (cost \(O(d_c \cdot T)\) instead of \(O(d_{\text{model}} \cdot T)\)). Net: fewer FLOPs per token at decode, not more.

In derivation.md:

  • Write out the matmul identity above.
  • Compute attention FLOPs per token, naive (decompressed) MLA vs the reorganized form.
  • Confirm the reorganized form is strictly better than standard MHA on FLOPs per token at decode, assuming \(d_c < d_{\text{model}}\).

Block C — compute parameter overhead

Compare parameter counts:

  • Standard MHA: \(W_Q, W_K, W_V, W_O\) per layer. Total: \(4 L d_{\text{model}}^2\).
  • MLA: \(W_Q\) (or its low-rank variant), \(W_{DKV}, W_{UK}, W_{UV}, W_O\), plus the rotary split. Total: \(L \cdot (d_{\text{model}} \cdot (d_c + 2 d_c + d_h^R) + d_{\text{model}}^2)\) approximately.

Show:

  • Parameter overhead ratio MLA/MHA as a function of \(d_c/d_{\text{model}}\).
  • At DeepSeek scale (\(d_c/d_{\text{model}} = 512/5120 = 0.1\)): parameter overhead = X%.
  • At grammar-tutor scale (\(d_c/d_{\text{model}} = 16/64 = 0.25\)): parameter overhead = X%.

The grammar-tutor scale has higher relative parameter overhead because \(d_c\) can't shrink below "useful" — there's a floor.

Block D — plot the curves

kv-cache-curves.py produces a single 2-panel figure:

Panel 1: KV cache size (MB) vs context length (tokens), at the grammar-tutor scale. Two lines: MHA, MLA. Annotate the maximum-context point (32 tokens for the tutor).

Panel 2: same plot at DeepSeek-V2 scale. Two lines: MHA, MLA. Annotate at 4k, 32k, 128k tokens.

Commit kv-cache-curves.png.

What to look for:

  • Panel 1: the two lines are both tiny (kilobytes). MLA's reduction is real but irrelevant.
  • Panel 2: the gap is dramatic. MHA blows past 50 GB at 128k; MLA stays under 5 GB.

Block E — the findings report

findings.md (~300 words):

  • What's the crossover context length (in MB of KV cache) at which MLA starts to matter?
  • At the grammar tutor's max context (32 tokens), what's the absolute KV cache size with and without MLA?
  • What parameter overhead does MLA add at the grammar-tutor scale? Is it worth it?
  • Verdict: should we add MLA to the grammar tutor? Why or why not?
  • When would MLA help, hypothetically? (The 5-language-grammar-tutor counterfactual from theory/02.)

Constraints

  • No PyTorch code. This is a math + plotting exercise. Pure NumPy + matplotlib.
  • No model training. Don't try to "verify" MLA by training one — that's not the lab's goal.
  • Show your algebra. If derivation.md skips steps, you're not learning. Write each substitution explicitly.

Stop conditions

You're done when:

  1. experiments/36-mla-math-exercise/{derivation.md, kv-cache-curves.py, kv-cache-curves.png, findings.md} all exist.
  2. derivation.md walks through all four blocks (KV cache size, attention reorganization, parameter overhead, ratios).
  3. The plot has both panels and both context-length scales.
  4. findings.md answers all five questions.
  5. You can recite, from memory, "MLA reduces KV cache by [ratio]× at [parameters] / [latent dim] / [layers]." That's the magic number that comes out of the algebra.

Hint of last resort

If you get stuck on the matmul reorganization (Block B): re-read DeepSeek-V2's section 2.1 of their paper. The reorganization is the key insight; the rest of the paper is implementation. After you've spent 30 minutes wrestling with it, allow yourself to read the paper's exposition.

If your KV-cache ratios don't match the expected ~10–18× at DeepSeek scale: double-check whether you accounted for \(d_h^R\) (the rotary K component cached separately). Forgetting that term gives an over-optimistic ratio.

When to consult solutions/

After findings.md is committed. Solution lives in solutions/01-mla-math-ref.md — written at phase open. The reference includes the worked numbers at both scales, exactly the table from Block A filled in.


Next lab: lab/02-mamba-walkthrough.md.