English · Español

03 — State-space models: RWKV, S4, Mamba, and the selective scan¶

🇪🇸 La atención escala \(O(T^2)\) con la longitud de la secuencia. Los modelos de espacio de estados (SSM) la reemplazan por una recurrencia: cada token actualiza un estado oculto, sin atención al pasado. Tiempo lineal, memoria constante por capa. El "secret sauce" de Mamba sobre los S4 anteriores: los pesos de la recurrencia dependen del input (selectividad).

Attention's cost is the elephant in every long-context room. Compute is \(O(T^2 \cdot d_{\text{model}})\) for context length \(T\). KV cache grows linearly in \(T\). By \(T = 100k\), you're in serious memory and time territory; by \(T = 1M\), you're impossible without specialized hardware.

The state-space-model (SSM) family — including RWKV, S4, S5, and Mamba — sidesteps attention entirely. Each token's representation is computed as a recurrent state update, not as a soft lookup over the past. Linear time. Constant memory per layer. No KV cache.

For Phase 36 we focus on Mamba — the variant that's competitive with transformers — and the trick that makes it work: selective scan.

The starting point: linear recurrence¶

A simple linear recurrence over an input sequence \(x_1, \ldots, x_T\):

\[h_t = A h_{t-1} + B x_t, \quad y_t = C h_t\]

with \(A \in \mathbb{R}^{N \times N}, B \in \mathbb{R}^{N \times d_{\text{in}}}, C \in \mathbb{R}^{d_{\text{out}} \times N}\), \(h_t \in \mathbb{R}^N\) the hidden state.

Memory: \(h_t\) is the only state needed for \(h_{t+1}\) — constant \(O(N)\). Time: \(O(T)\) for the full sequence.

Compare: a standard LSTM is a nonlinear recurrence; an attention layer has no recurrence (every token-pair interaction is direct).

So why aren't SSMs the obvious choice? Two reasons:

Without nonlinearity in the state update, capacity is limited. Vanilla linear SSMs underperform LSTMs and attention.
Parameter sharing across timesteps means the model is input-independent in the recurrence dynamics. An LSTM's gates depend on the input; here \(A, B, C\) are fixed.

S4 (Gu et al., 2022) addressed (1) with structured \(A\) matrices (HiPPO) that have nice frequency-domain properties. S4 was competitive on long-range benchmarks but still not on language modeling.

Mamba (Gu and Dao, 2023) addressed (2) with selectivity.

Mamba's selective mechanism¶

Mamba parameterizes \(B, C\), and a discretization step size \(\Delta\) as input-dependent:

\[B = \text{Linear}_B(x_t), \quad C = \text{Linear}_C(x_t), \quad \Delta = \text{softplus}(\text{Linear}_\Delta(x_t))\]

Now the recurrence's dynamics change at every token, conditioned on the input. The model can:

"Open the gate" for a relevant token (large \(\Delta\), large \(B\)): the hidden state absorbs the new input.
"Close the gate" for an irrelevant token (small \(\Delta\), small \(B\)): the hidden state stays roughly unchanged.
Adjust the output projection \(C\) per token.

This is Mamba. The "selective" in its name is exactly this input-dependent parameterization.

Discretization¶

Mamba defines a continuous-time linear system \(\dot{h}(t) = A h(t) + B x(t)\) and discretizes via zero-order hold:

\[\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B\]

\(\Delta\) controls the timescale: larger \(\Delta\) means each timestep advances the system further in continuous time, giving more weight to the new input.

In practice \(A\) is parameterized as a diagonal matrix so \(\exp(\Delta A)\) is element-wise — a tractable matrix exponential.

Parallel scan¶

The recurrence \(h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t\) looks sequential, but it's an associative operation: there's a parallel scan algorithm that computes all \(h_t\) in \(O(\log T)\) depth on GPU. The official Mamba implementation has a custom CUDA kernel for this; mamba-minimal uses a slower Python/JAX version.

This is what makes Mamba fast at training despite the recurrence: parallel scan exploits GPU parallelism in a way RNNs cannot.

Inference cost¶

At decode time, Mamba is genuinely constant per token:

Update \(h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t\): \(O(N \cdot d_{\text{in}})\) — independent of past length.
Output \(y_t = C_t h_t\): \(O(N \cdot d_{\text{out}})\).

Total: \(O(N \cdot (d_{\text{in}} + d_{\text{out}}))\), no dependence on past sequence length.

Memory per layer at inference: \(h_t \in \mathbb{R}^N\). No growing KV cache.

For Mamba's released models (~3B params, \(N = 64\)): per-token inference is \(\sim 200\)k FLOPs — comparable to a small transformer at modest context. The win shows up at long context. A 100k-token Mamba decode is the same speed as a 100-token Mamba decode. A 100k-token transformer decode is 1000× slower than a 100-token decode.

Hybrid models (Jamba)¶

Pure Mamba has its own limits: empirically, it does worse than transformers on tasks requiring exact recall (in-context lookup of a specific token). The hypothesis: the recurrence compresses information into the state; exact retrieval of a specific past token is harder than attention's direct lookup.

Jamba (AI21, 2024) interleaves Mamba blocks with attention blocks. Most layers are Mamba (cheap, fast); a few are attention (capability). Best of both: linear-ish total cost, exact recall when needed.

Mention; do not derive. The hybridization principle is the takeaway.

RWKV — a cousin¶

RWKV ("Receptance Weighted Key Value") is a different SSM family — same goal (linear-time recurrence with attention-like capability), different mechanics. Uses a "time-mix" + "channel-mix" pattern with no explicit attention. Open-source community-driven; competitive with mid-size transformers; ongoing iteration.

For Phase 36: mention RWKV exists, point to the website. Don't derive.

Would Mamba help the grammar tutor?¶

The test:

Mamba's bottleneck: "attention's quadratic time and growing KV cache at very long context."
Does the grammar tutor have it? Max context: 32 tokens. Attention cost: \(32^2 \cdot d_{\text{model}}\). With \(d_{\text{model}} = 64\), that's 131k FLOPs — negligible. KV cache for 32 tokens at this scale: \(\sim 16\) KB total. No bottleneck.
What would Mamba cost? Replacing attention with selective scan in src/minimodel/transformer.py: substantial rewrite. New tuning surface (\(N\), \(\Delta\) scaling, selectivity parameterization). New failure modes (state collapse, instability).
Verdict: Never. The grammar tutor's tasks involve short-range syntactic agreement (subject-verb-tense), exactly where attention's direct-lookup property is more useful than recurrence-and-recall.

Subtle point on Mamba and grammar¶

Grammar tasks specifically reward attention. A verb-conjugation correction needs to attend to the subject's person and number — a precise, direct lookup of a specific past token. Mamba can learn this through its state but does it less crisply than attention's pinpoint mechanism. For the grammar tutor, attention is positively the right choice — not just "fine to keep", actually better than the alternatives. That's a nicer outcome than "doesn't help"; it's "the original design was correct, and we can now defend it."

When Mamba would help (the counterfactual)¶

A 1M-token context grammar-correction agent (book-length proofreading, parsing entire novels for tense consistency): the attention KV cache becomes prohibitive. Mamba's constant-memory recurrence is then attractive. But you'd also accept some quality drop on exact-recall queries — a tradeoff requiring a hybrid (Jamba-like) design.

Out of scope, but Borja should be able to sketch this conversation by phase end.

What this phase does NOT cover¶

Implementing Mamba. Read-only via mamba-minimal. The full Mamba codebase has CUDA kernels; not in our wheelhouse.
S4 and HiPPO derivation. Mentioned; derivation is in the original paper. Phase 36 stops at "S4 was the predecessor; Mamba added selectivity."
RWKV in math depth. Mention only.
Training a Mamba on the grammar corpus. Would take weeks of CPU time and produce a clearly worse result. The math in §"Would Mamba help" is the answer; no need to spend the compute.
Hybrid attention+Mamba architectures. Vocabulary (Jamba) only.

Next: theory/04-speculative-and-reasoning.md.