Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu1, Tri Dao2 • 1CMU, 2Princeton University
TL;DR
Mamba replaces Transformers' quadratic-time attention with a Selective State Space Model (S6). By making state transitions input-dependent, it achieves Transformer-level reasoning quality while scaling linearly in sequence length, offering up to 5× higher generation throughput.
Why This Matters
Modern deep learning is almost entirely built on the Transformer architecture and its core self-attention mechanism. Attention allows models to route information densely across any context window, but it comes with a fatal flaw: computational and memory requirements scale quadratically ($O(L^2)$) with the sequence length $L$.
To process long contexts (like entire books, high-resolution audio, or genomic sequences), researchers have long sought sub-quadratic alternatives. However, previous attempts—such as linear attention, gated convolutions, and classical Structured State Space Models (SSMs)—struggle on information-dense, discrete modalities like natural language.
The Core Weakness: LTI Bottleneck
Prior SSMs were restricted to be Linear Time-Invariant (LTI). Because their dynamics ($A, B, C$) did not change based on the input tokens, they could not perform content-based routing. They compressed context statically, failing tasks like "Selective Copying" or "Induction Heads" that require selective memorization.
The Big Idea
Mamba introduces selectivity to State Space Models. By letting the transition parameters ($\Delta, B, C$) be dynamic functions of the input $x_t$ at each timestep, the model can actively choose what to remember and what to discard.
However, making these parameters input-dependent breaks the mathematical equivalence to global convolution, meaning we can no longer use the Fast Fourier Transform (FFT) for parallel training. To solve this, Mamba introduces a hardware-aware parallel associative scan. Instead of materializing the massive hidden state in slow GPU memory (HBM), it fuses the discretization and scan steps into fast GPU SRAM, making training even faster than traditional convolutions.
How It Works
1. From Continuous to Discrete SSMs
A continuous-time State Space Model maps a 1D input sequence $x(t)$ to a 1D output $y(t)$ through an $N$-dimensional latent state $h(t)$:
To use this in deep learning, we must discretize it using a step size parameter $\Delta$. Using the Zero-Order Hold (ZOH) rule, the continuous parameters $(A, B)$ are transformed into discrete parameters $(\bar{A}, \bar{B})$:
This yields the discrete recurrence relation:
2. The Selection Mechanism (S6)
In standard S4, $(\Delta, B, C)$ are static parameters learned during training. In Mamba (S6), they are computed dynamically from the input $x_t$:
Under this formulation, the model acts as a gated recurrent network. For instance, when $\Delta_t$ is large, the model resets its state $h_t$ to focus on the current input $x_t$. When $\Delta_t$ is small, it ignores $x_t$ and preserves the historical state.
Interactive Demos
The Selection Mechanism (S6) Simulator
Toggle token types to see how Mamba dynamically adjusts its step size $\Delta_t$ and gate $g_t$ to filter out noise and remember key context.
Dynamic Step Size ($\Delta_t$)
Latent State Memory ($h_t$)
LTI (S4) vs. Selective (S6) Processing
Contrast how traditional linear time-invariant models process data statically compared to Mamba's input-dependent routing.
Hardware-Aware Memory IO Calculator
Adjust the sequence length $L$ and state dimension $N$ to see how Mamba's fused kernel avoids the memory bottleneck of standard implementations.
Memory IO Footprint (HBM Read/Write)
Empirical Results
Mamba was evaluated across language modeling, genomics, and audio. The results establish S6 as the first linear-time sequence backbone to match or exceed optimized Transformers.
Language Modeling Perplexity (Lower is Better)
Pile Dataset, 300B TokensMamba models consistently outperform Transformers of comparable size and often match the quality of Transformers twice their size.
Limitations & Open Questions
While Mamba represents a massive leap forward for linear sequence models, the authors note several challenges and areas for future work:
- Continuous-Discrete Spectrum: While input-dependent selectivity drastically improves performance on discrete data (like text and DNA), it can occasionally degrade performance on highly continuous physical signals (like raw audio waveforms) where LTI architectures naturally excel.
- Ecosystem Integration: Transformers have a mature ecosystem of hardware optimizations, quantization recipes, and specialized fine-tuning techniques (LoRA, etc.). Porting these robustly to state-space architectures remains an active research direction.
- Scaling Limits: Empirical evaluations were mostly bounded to models up to 3B parameters. Testing Mamba's performance at extreme scales (70B+ parameters) is still underway.
Glossary
LTI (Linear Time-Invariance)
A system whose underlying physical properties do not change over time. In deep learning, LTI sequence models use static weights at every step, which enables fast parallel training via convolutions but prevents dynamic, content-aware filtering.
SSM (State Space Model)
A mathematical framework that models a system's behavior using hidden internal states. In deep learning, structured SSMs are used to map 1D sequences efficiently by combining properties of RNNs (recurrent step) and CNNs (parallelizable training).
Parallel Associative Scan
An algorithmic primitive that parallelizes sequential prefix-sum computations. Because state-space recurrences are associative, they can be calculated in $O(\log L)$ parallel steps on GPUs instead of $O(L)$ sequential steps.
Citation
@article{gu2023mamba,
title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
author={Gu, Albert and Dao, Tri},
journal={arXiv preprint arXiv:2312.00752},
year={2023}
}