English · Español

02 — Multi-Latent Attention (MLA)¶

🇪🇸 La KV cache crece linealmente con el contexto y, en modelos grandes, llena la HBM. MLA (DeepSeek-V2) la comprime: en lugar de cachear K y V completos, cachea un latente de baja dimensión, y reconstruye K y V sobre la marcha. Reducción típica: 4–8×. Coste: cómputo extra por token (la reconstrucción).

The KV cache is the single largest memory consumer at inference time for any reasonably-sized model. For a 32-layer, 32-head, 128-head-dim model at fp16:

\[\text{KV bytes/token} = 2 \cdot L \cdot H \cdot d_{\text{head}} \cdot 2 = 2 \cdot 32 \cdot 32 \cdot 128 \cdot 2 = 512\,\text{KB}/\text{token}\]

A 100k-token context: 50 GB. The KV cache, not the model weights, is what makes long-context inference expensive.

DeepSeek-V2's Multi-Latent Attention (MLA) compresses the K-V representation to a low-dimensional latent, caches only the latent, and reconstructs K and V on the fly when needed. Typical cache reduction: 4–8×. Compute cost: a small projection during attention computation.

The mechanism¶

Standard multi-head attention (refresher)¶

Given input \(h_t \in \mathbb{R}^{d_{\text{model}}}\) at position \(t\):

\[q_t = W_Q h_t, \quad k_t = W_K h_t, \quad v_t = W_V h_t\]

Each is \(\mathbb{R}^{d_{\text{model}}}\), then reshaped to \(H\) heads of \(d_{\text{head}}\) each (with \(H \cdot d_{\text{head}} = d_{\text{model}}\)).

The KV cache stores \(k_t\) and \(v_t\) for all \(t\) in the sequence. Memory cost per token: \(2 \cdot d_{\text{model}}\) values per layer.

MLA's compression¶

MLA introduces a "compressed latent" \(c_t \in \mathbb{R}^{d_c}\) with \(d_c \ll d_{\text{model}}\). The latent is a learned projection of the input:

\[c_t = W_{DKV} h_t \in \mathbb{R}^{d_c}\]

(where \(W_{DKV}\) is "Down-projection for KV", in DeepSeek's naming).

At attention time, \(k_t\) and \(v_t\) are reconstructed from \(c_t\):

\[k_t = W_{UK} c_t, \quad v_t = W_{UV} c_t\]

(Up-projections.) Now: the cache stores only \(c_t\), not \(k_t, v_t\).

Memory cost per token: \(d_c\) values per layer. With \(d_c = 512\) (DeepSeek-V2's choice for \(d_{\text{model}} = 5120\)): 10× reduction.

What about Q?¶

Q gets a similar treatment in DeepSeek's MLA, but Q is not cached (Q is recomputed every forward pass, not reused across token positions). So Q's compression is purely a parameter-savings game, not a cache-savings one. The math is in DeepSeek-V2's paper; for Phase 36 we focus on the K/V compression, which is the cache story.

Positional encoding hack¶

A subtlety: standard MLA-style K/V reconstruction is incompatible with RoPE (rotary positional embeddings, Phase 16) because RoPE rotates K and Q position-dependently — but the cached \(c_t\) has no notion of position. DeepSeek's workaround: split K into a "decoupled rotary part" \(k_t^{\text{rope}} \in \mathbb{R}^{d_h^R}\) and a "no-rotary part" \(k_t^{\text{c}} \in \mathbb{R}^{d_h^C}\). The rotary part is cached as-is (small, \(d_h^R \approx 64\)); the no-rotary part is recovered from the latent.

For Phase 36's purposes: knowing the trick exists is enough. The math of the rotary split is in the paper and the reading lab.

The math, end-to-end¶

Memory cost¶

Setup	KV cache per token	KV cache for 100k tokens, 32 layers
Standard MHA, \(d_{\text{model}}=5120\)	\(2 \cdot 5120 = 10240\) values	64 GB at fp16
MLA, \(d_c = 512\), \(d_h^R = 64\)	\(512 + 64 = 576\) values	3.6 GB at fp16
Ratio	17.7×	17.7× reduction

That's why DeepSeek can serve their 236B-parameter model with long context on hardware that would be impossible for a standard architecture.

Compute cost¶

Standard attention compute per token (decode): \(O(d_{\text{model}}^2 + d_{\text{model}} \cdot T)\) where \(T\) is past length.

MLA adds: - Down-projection: \(d_{\text{model}} \cdot d_c\) per token (small, since \(d_c \ll d_{\text{model}}\)). - Up-projection of K and V at attention time: \(2 \cdot d_c \cdot d_{\text{model}}\) per token attended-to. Since attention attends to all past tokens, that's \(2 \cdot d_c \cdot d_{\text{model}} \cdot T\).

Compared to standard's \(d_{\text{model}} \cdot T\) (the QK dot product cost): MLA is approximately \(2 d_c\)-fold more compute per attended token. With \(d_c = 512\), that's ~1024×... wait, that's bad.

The reality is that DeepSeek's MLA reorganizes the matmuls to fuse the up-projection into the attention math, dropping the apparent cost. The matrix identity:

\[\text{attn}(q_t, k_\tau, v_\tau) = q_t^\top (W_{UK} c_\tau) \cdot (W_{UV} c_\tau)\]

can be reorganized (with some fixed-precomputed matrix products) to compute attention directly from \(c_\tau\) at a cost of \(O(d_c \cdot T)\) instead of \(O(d_{\text{model}} \cdot T)\). Net compute reduction, not increase. The trick is in the implementation.

The reading lab (lab/01-mla-math-exercise.md) walks the matmul reorganization.

What MLA is not¶

Not the same as Multi-Query Attention (MQA) or Grouped-Query Attention (GQA). Those reduce cache by sharing K and V across heads. They're complementary; you can have GQA + MLA (DeepSeek-V2 does).
Not an approximation. The reconstructed K and V are exactly \(W_{UK} c_t\) and \(W_{UV} c_t\). The information that's lost is whatever the low-rank projection drops — but the model learned that compression jointly with the rest, so it's not lossy in any output-quality sense (DeepSeek reports parity).
Not free. Two extra weight matrices per layer (\(W_{UK}, W_{UV}\)). Parameter overhead is modest, ~5% of model size. Worth it for the 10× cache reduction.

Would MLA help the grammar tutor?¶

Apply the test:

MLA's bottleneck: "KV cache memory at long context, in a large model."
Does the grammar tutor have it? Memory math: \(L=4\) layers, \(H=4\) heads, \(d_{\text{head}}=16\), max context 32 tokens at fp32: KV cache = \(2 \cdot 4 \cdot 4 \cdot 16 \cdot 32 \cdot 4 = 32\) KB total. There is no KV cache problem to solve. A 17× reduction of 32 KB is 1.9 KB. So what.
What would MLA cost? Two new matrices per layer (~8 KB more parameters). New attention math. New code-path to maintain.
Verdict: Never.

MLA is the textbook example of "right tool, wrong scale." DeepSeek-V2 has the bottleneck (236B params, 128k context, multi-user serving on H100 clusters). Lynx-Cortex's grammar tutor has none of it. Recognizing this in 30 seconds, instead of building MLA "to be modern", is the skill the phase rewards.

When MLA would help (the counterfactual)¶

If the grammar tutor's vocabulary grew to 600k forms (5 languages × paraphrases × etc.) and you wanted to serve millions of concurrent users with 1k-token contexts: now you have a KV cache problem. Then MLA earns its keep. Until then, it's a beautifully-designed solution to a problem you don't have.

What this phase does NOT cover¶

MLA's decoupled-rotary split in math depth. The mechanism is explained; full derivation is in the lab's reading.
Comparing MLA to other K/V-compression techniques (e.g., quantized KV cache, see Phase 26 / Q4_K_M). Phase 36 covers these conceptually only.
Implementing MLA in PyTorch. Read-only. The reorganized attention matmul is non-trivial; the official DeepSeek implementation is ~200 lines and easier to read than rewrite.
MLA + speculative decoding interaction. Phase 36's theory/04-speculative-and-reasoning.md mentions; not derived.

Next: theory/03-state-space-models.md.