English · Español

Lab 02 — Mamba selective-scan walkthrough (reading lab)¶

Goal: read mamba-minimal end-to-end. Annotate the selective-scan logic. Write a 1-page summary connecting the math (theory/03) to the code.

Estimated time: 2–3 hours.

Prereq: theory/03-state-space-models.md understood. Phase 25 (PyTorch internals) done. Borja can read a transformer reference implementation comfortably.

What you produce¶

A directory experiments/36-mamba-walkthrough/ containing:

mamba-sha.txt — the SHA of the mamba-minimal repo read.
walkthrough.md — ~1-page annotated reading of the selective-scan core (selective_scan_ref or equivalent).
state-update-diagram.mmd — mermaid diagram of one step of selective scan, annotated with shapes.
grammar-tutor-applicability.md — short verdict: would Mamba help the grammar tutor?

TODOs¶

Block A — clone mamba-minimal¶

git clone https://github.com/johnma2006/mamba-minimal /tmp/mamba-minimal
cd /tmp/mamba-minimal
git rev-parse HEAD > /home/overdrive/claude/lynx-cortex/experiments/36-mamba-walkthrough/mamba-sha.txt

mamba-minimal is a deliberately pedagogical reimplementation (~300 LOC) — much easier to read than the official Mamba repo which uses CUDA kernels. Read the educational version.

Block B — walk the file¶

The interesting file is model.py. Focus on these functions / classes:

MambaBlock — the building block (analogous to a transformer block).
selective_scan (or selective_scan_ref depending on version) — the core recurrence.

In walkthrough.md, write annotations covering:

The discretization step. Where in the code is the continuous-to-discrete transition (computing \(\bar{A}, \bar{B}\) from \(A, B, \Delta\))? Cite line numbers.
The selectivity. Which lines make \(B, C, \Delta\) input-dependent? (As opposed to S4, where these are fixed parameters.)
The state update. Trace \(h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t\) — find the corresponding line(s).
The output projection. Where is \(y_t = C_t h_t\)?
The convolution. Mamba uses a 1D conv as a pre-processor. Why? (Hint: gives short-range smoothing before the SSM.)

Each annotation: 2-3 sentences + line citation. ~5-8 annotations total.

Block C — the shape diagram¶

Draw the mermaid diagram for one selective-scan step:

flowchart LR
    x[x_t : (B, D)] --> deltaP[Linear -> Δ_t : (B, D)]
    x --> Bproj[Linear -> B_t : (B, N)]
    x --> Cproj[Linear -> C_t : (B, N)]
    A_param[A : (D, N) param] -.discretize.-> Abar[Ā_t : (B, D, N)]
    deltaP --> Abar
    Bproj --> Bbar[B̄_t : (B, D, N)]
    deltaP --> Bbar
    h_prev[h_{t-1} : (B, D, N)] --> update[h_t = Ā_t · h_{t-1} + B̄_t · x_t]
    Abar --> update
    Bbar --> update
    update --> h_t[h_t : (B, D, N)]
    h_t --> output[y_t = C_t · h_t : (B, D)]
    Cproj --> output

Commit as state-update-diagram.mmd. Annotate the diagram by adding a "where in the code" reference next to each box (e.g., "Δ projection: line 142").

Block D — the grammar-tutor applicability¶

Write grammar-tutor-applicability.md (~200 words):

Would Mamba help the grammar tutor? (Spoiler from theory/03: no.)
What specifically about the grammar-tutor's task makes attention strictly better than Mamba? (Answer: subject-verb-tense agreement requires precise lookup of a specific past token; Mamba compresses past into a state, attention reads directly.)
When would you reach for Mamba? (Answer: very long context, where the KV cache becomes infeasible.)
What about a hybrid (Jamba-like) approach? Could a single attention layer + multiple Mamba layers help? (Hint: probably not, at our 32-token max context. Attention's compute at this scale is negligible.)

Constraints¶

No Mamba training. This is a reading lab. Spinning up Mamba inference is fine if you want to feel the speed; training is out-of-scope (and slow on CPU).
No copying source. Cite line ranges, not full code. The walkthrough should be a summary, not a transcription.
Mermaid diagrams only. No PNGs from external tools — keep things diff-able.
CPU-only, zero cloud spend.

Stop conditions¶

You're done when:

experiments/36-mamba-walkthrough/{mamba-sha.txt, walkthrough.md, state-update-diagram.mmd, grammar-tutor-applicability.md} all exist.
walkthrough.md has ≥5 annotated points with line citations.
The mermaid diagram is committed and renders correctly.
The grammar-tutor applicability note answers all four questions.
You can explain, from memory, "what makes Mamba 'selective'" in one sentence.

Hint of last resort¶

If mamba-minimal has drifted from the version this lab was written against: pin the version with the SHA at clone time. If the function names changed (selective_scan vs selective_scan_ref etc.), use whichever matches the current file.

If the discretization step (Block B point 1) is opaque: the formula is in theory/03-state-space-models.md §"Discretization". Compare the math to the code one line at a time. The code is doing exactly the math, in PyTorch ops.

When to consult `solutions/`¶

After committing. Solution lives in solutions/02-mamba-walkthrough-ref.md — written at phase open with the current mamba-minimal version pinned. The reference is a set of annotation picks with line ranges; Borja's picks may differ — the comparison is "what did I miss?", not "did I match exactly?".

Next lab: lab/03-speculative-survey.md.