English · Español
03 — State-space models: RWKV, S4, Mamba, and the selective scan¶
🇪🇸 La atención escala \(O(T^2)\) con la longitud de la secuencia. Los modelos de espacio de estados (SSM) la reemplazan por una recurrencia: cada token actualiza un estado oculto, sin atención al pasado. Tiempo lineal, memoria constante por capa. El "secret sauce" de Mamba sobre los S4 anteriores: los pesos de la recurrencia dependen del input (selectividad).
Attention's cost is the elephant in every long-context room. Compute is \(O(T^2 \cdot d_{\text{model}})\) for context length \(T\). KV cache grows linearly in \(T\). By \(T = 100k\), you're in serious memory and time territory; by \(T = 1M\), you're impossible without specialized hardware.
The state-space-model (SSM) family — including RWKV, S4, S5, and Mamba — sidesteps attention entirely. Each token's representation is computed as a recurrent state update, not as a soft lookup over the past. Linear time. Constant memory per layer. No KV cache.
For Phase 36 we focus on Mamba — the variant that's competitive with transformers — and the trick that makes it work: selective scan.
The starting point: linear recurrence¶
A simple linear recurrence over an input sequence \(x_1, \ldots, x_T\):
with \(A \in \mathbb{R}^{N \times N}, B \in \mathbb{R}^{N \times d_{\text{in}}}, C \in \mathbb{R}^{d_{\text{out}} \times N}\), \(h_t \in \mathbb{R}^N\) the hidden state.
Memory: \(h_t\) is the only state needed for \(h_{t+1}\) — constant \(O(N)\). Time: \(O(T)\) for the full sequence.
Compare: a standard LSTM is a nonlinear recurrence; an attention layer has no recurrence (every token-pair interaction is direct).
So why aren't SSMs the obvious choice? Two reasons:
- Without nonlinearity in the state update, capacity is limited. Vanilla linear SSMs underperform LSTMs and attention.
- Parameter sharing across timesteps means the model is input-independent in the recurrence dynamics. An LSTM's gates depend on the input; here \(A, B, C\) are fixed.
S4 (Gu et al., 2022) addressed (1) with structured \(A\) matrices (HiPPO) that have nice frequency-domain properties. S4 was competitive on long-range benchmarks but still not on language modeling.
Mamba (Gu and Dao, 2023) addressed (2) with selectivity.
Mamba's selective mechanism¶
Mamba parameterizes \(B, C\), and a discretization step size \(\Delta\) as input-dependent:
Now the recurrence's dynamics change at every token, conditioned on the input. The model can:
- "Open the gate" for a relevant token (large \(\Delta\), large \(B\)): the hidden state absorbs the new input.
- "Close the gate" for an irrelevant token (small \(\Delta\), small \(B\)): the hidden state stays roughly unchanged.
- Adjust the output projection \(C\) per token.
This is Mamba. The "selective" in its name is exactly this input-dependent parameterization.
Discretization¶
Mamba defines a continuous-time linear system \(\dot{h}(t) = A h(t) + B x(t)\) and discretizes via zero-order hold:
\(\Delta\) controls the timescale: larger \(\Delta\) means each timestep advances the system further in continuous time, giving more weight to the new input.
In practice \(A\) is parameterized as a diagonal matrix so \(\exp(\Delta A)\) is element-wise — a tractable matrix exponential.
Parallel scan¶
The recurrence \(h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t\) looks sequential, but it's an associative operation: there's a parallel scan algorithm that computes all \(h_t\) in \(O(\log T)\) depth on GPU. The official Mamba implementation has a custom CUDA kernel for this; mamba-minimal uses a slower Python/JAX version.
This is what makes Mamba fast at training despite the recurrence: parallel scan exploits GPU parallelism in a way RNNs cannot.
Inference cost¶
At decode time, Mamba is genuinely constant per token:
- Update \(h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t\): \(O(N \cdot d_{\text{in}})\) — independent of past length.
- Output \(y_t = C_t h_t\): \(O(N \cdot d_{\text{out}})\).
Total: \(O(N \cdot (d_{\text{in}} + d_{\text{out}}))\), no dependence on past sequence length.
Memory per layer at inference: \(h_t \in \mathbb{R}^N\). No growing KV cache.
For Mamba's released models (~3B params, \(N = 64\)): per-token inference is \(\sim 200\)k FLOPs — comparable to a small transformer at modest context. The win shows up at long context. A 100k-token Mamba decode is the same speed as a 100-token Mamba decode. A 100k-token transformer decode is 1000× slower than a 100-token decode.
Hybrid models (Jamba)¶
Pure Mamba has its own limits: empirically, it does worse than transformers on tasks requiring exact recall (in-context lookup of a specific token). The hypothesis: the recurrence compresses information into the state; exact retrieval of a specific past token is harder than attention's direct lookup.
Jamba (AI21, 2024) interleaves Mamba blocks with attention blocks. Most layers are Mamba (cheap, fast); a few are attention (capability). Best of both: linear-ish total cost, exact recall when needed.
Mention; do not derive. The hybridization principle is the takeaway.
RWKV — a cousin¶
RWKV ("Receptance Weighted Key Value") is a different SSM family — same goal (linear-time recurrence with attention-like capability), different mechanics. Uses a "time-mix" + "channel-mix" pattern with no explicit attention. Open-source community-driven; competitive with mid-size transformers; ongoing iteration.
For Phase 36: mention RWKV exists, point to the website. Don't derive.
Would Mamba help the grammar tutor?¶
The test:
- Mamba's bottleneck: "attention's quadratic time and growing KV cache at very long context."
- Does the grammar tutor have it? Max context: 32 tokens. Attention cost: \(32^2 \cdot d_{\text{model}}\). With \(d_{\text{model}} = 64\), that's 131k FLOPs — negligible. KV cache for 32 tokens at this scale: \(\sim 16\) KB total. No bottleneck.
- What would Mamba cost? Replacing attention with selective scan in
src/minimodel/transformer.py: substantial rewrite. New tuning surface (\(N\), \(\Delta\) scaling, selectivity parameterization). New failure modes (state collapse, instability). - Verdict: Never. The grammar tutor's tasks involve short-range syntactic agreement (subject-verb-tense), exactly where attention's direct-lookup property is more useful than recurrence-and-recall.
Subtle point on Mamba and grammar¶
Grammar tasks specifically reward attention. A verb-conjugation correction needs to attend to the subject's person and number — a precise, direct lookup of a specific past token. Mamba can learn this through its state but does it less crisply than attention's pinpoint mechanism. For the grammar tutor, attention is positively the right choice — not just "fine to keep", actually better than the alternatives. That's a nicer outcome than "doesn't help"; it's "the original design was correct, and we can now defend it."
When Mamba would help (the counterfactual)¶
A 1M-token context grammar-correction agent (book-length proofreading, parsing entire novels for tense consistency): the attention KV cache becomes prohibitive. Mamba's constant-memory recurrence is then attractive. But you'd also accept some quality drop on exact-recall queries — a tradeoff requiring a hybrid (Jamba-like) design.
Out of scope, but Borja should be able to sketch this conversation by phase end.
What this phase does NOT cover¶
- Implementing Mamba. Read-only via
mamba-minimal. The full Mamba codebase has CUDA kernels; not in our wheelhouse. - S4 and HiPPO derivation. Mentioned; derivation is in the original paper. Phase 36 stops at "S4 was the predecessor; Mamba added selectivity."
- RWKV in math depth. Mention only.
- Training a Mamba on the grammar corpus. Would take weeks of CPU time and produce a clearly worse result. The math in §"Would Mamba help" is the answer; no need to spend the compute.
- Hybrid attention+Mamba architectures. Vocabulary (Jamba) only.
Next: theory/04-speculative-and-reasoning.md.