ArXiv 2024 Sequence Modeling

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu¹, Tri Dao² • ¹CMU, ²Princeton University

TL;DR

Mamba replaces Transformers' quadratic-time attention with a Selective State Space Model (S6). By making state transitions input-dependent, it achieves Transformer-level reasoning quality while scaling linearly in sequence length, offering up to 5× higher generation throughput.

Linear Scaling O(L)

Hardware-Aware Scan

No KV-Cache Bottleneck

Why This Matters

Modern deep learning is almost entirely built on the Transformer architecture and its core self-attention mechanism. Attention allows models to route information densely across any context window, but it comes with a fatal flaw: computational and memory requirements scale quadratically ($O(L^2)$) with the sequence length $L$.

To process long contexts (like entire books, high-resolution audio, or genomic sequences), researchers have long sought sub-quadratic alternatives. However, previous attempts—such as linear attention, gated convolutions, and classical Structured State Space Models (SSMs)—struggle on information-dense, discrete modalities like natural language.

The Core Weakness: LTI Bottleneck

Prior SSMs were restricted to be Linear Time-Invariant (LTI). Because their dynamics ($A, B, C$) did not change based on the input tokens, they could not perform content-based routing. They compressed context statically, failing tasks like "Selective Copying" or "Induction Heads" that require selective memorization.

The Big Idea

Mamba introduces selectivity to State Space Models. By letting the transition parameters ($\Delta, B, C$) be dynamic functions of the input $x_t$ at each timestep, the model can actively choose what to remember and what to discard.

However, making these parameters input-dependent breaks the mathematical equivalence to global convolution, meaning we can no longer use the Fast Fourier Transform (FFT) for parallel training. To solve this, Mamba introduces a hardware-aware parallel associative scan. Instead of materializing the massive hidden state in slow GPU memory (HBM), it fuses the discretization and scan steps into fast GPU SRAM, making training even faster than traditional convolutions.

How It Works

1. From Continuous to Discrete SSMs

A continuous-time State Space Model maps a 1D input sequence $x(t)$ to a 1D output $y(t)$ through an $N$-dimensional latent state $h(t)$:

$$h'(t) = Ah(t) + Bx(t)$$ $$y(t) = Ch(t)$$

To use this in deep learning, we must discretize it using a step size parameter $\Delta$. Using the Zero-Order Hold (ZOH) rule, the continuous parameters $(A, B)$ are transformed into discrete parameters $(\bar{A}, \bar{B})$:

$$\bar{A} = \exp(\Delta A)$$ $$\bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B$$

This yields the discrete recurrence relation:

$$h_t = \bar{A}h_{t-1} + \bar{B}x_t$$ $$y_t = Ch_t$$

2. The Selection Mechanism (S6)

In standard S4, $(\Delta, B, C)$ are static parameters learned during training. In Mamba (S6), they are computed dynamically from the input $x_t$:

$$B_t = \text{Linear}_N(x_t)$$ $$C_t = \text{Linear}_N(x_t)$$ $$\Delta_t = \text{softplus}(\text{Parameter} + \text{Linear}_1(x_t))$$

Under this formulation, the model acts as a gated recurrent network. For instance, when $\Delta_t$ is large, the model resets its state $h_t$ to focus on the current input $x_t$. When $\Delta_t$ is small, it ignores $x_t$ and preserves the historical state.

Interactive Demos

Interactive Demo 1

The Selection Mechanism (S6) Simulator

Toggle token types to see how Mamba dynamically adjusts its step size $\Delta_t$ and gate $g_t$ to filter out noise and remember key context.

Input Sequence (Click tokens to toggle importance):

Selectivity Strength 1.5

Base Memory Decay 0.8

Dynamic Step Size ($\Delta_t$)

Latent State Memory ($h_t$)

Interactive Demo 2

LTI (S4) vs. Selective (S6) Processing

Contrast how traditional linear time-invariant models process data statically compared to Mamba's input-dependent routing.

Input Stream Latent Compression Output

Key Info

Noise

Key Info

State $h_t$ Active

S6 selectively routes inputs based on content, ignoring the "noise" token.

Interactive Demo 3

Hardware-Aware Memory IO Calculator

Adjust the sequence length $L$ and state dimension $N$ to see how Mamba's fused kernel avoids the memory bottleneck of standard implementations.

Sequence Length ($L$) 32,768

State Dimension ($N$) 16

Batch Size ($B$) 8

Memory IO Footprint (HBM Read/Write)

Naive Implementation: 12.5 GB

Mamba Fused Kernel: 0.8 GB

15.6x memory IO reduction. Mamba loads parameters directly from slow HBM to fast SRAM, executing the scan entirely on-chip.

Empirical Results

Mamba was evaluated across language modeling, genomics, and audio. The results establish S6 as the first linear-time sequence backbone to match or exceed optimized Transformers.

Language Modeling Perplexity (Lower is Better)

Pile Dataset, 300B Tokens

Pythia-160M (Transformer) 38.10 PPL

Mamba-130M (Ours) 16.07 PPL

Pythia-1B (Transformer) 7.92 PPL

Mamba-790M (Ours) 6.02 PPL

Mamba models consistently outperform Transformers of comparable size and often match the quality of Transformers twice their size.

Limitations & Open Questions

While Mamba represents a massive leap forward for linear sequence models, the authors note several challenges and areas for future work:

Continuous-Discrete Spectrum: While input-dependent selectivity drastically improves performance on discrete data (like text and DNA), it can occasionally degrade performance on highly continuous physical signals (like raw audio waveforms) where LTI architectures naturally excel.
Ecosystem Integration: Transformers have a mature ecosystem of hardware optimizations, quantization recipes, and specialized fine-tuning techniques (LoRA, etc.). Porting these robustly to state-space architectures remains an active research direction.
Scaling Limits: Empirical evaluations were mostly bounded to models up to 3B parameters. Testing Mamba's performance at extreme scales (70B+ parameters) is still underway.

Glossary

LTI (Linear Time-Invariance)

A system whose underlying physical properties do not change over time. In deep learning, LTI sequence models use static weights at every step, which enables fast parallel training via convolutions but prevents dynamic, content-aware filtering.

SSM (State Space Model)

A mathematical framework that models a system's behavior using hidden internal states. In deep learning, structured SSMs are used to map 1D sequences efficiently by combining properties of RNNs (recurrent step) and CNNs (parallelizable training).

Parallel Associative Scan

An algorithmic primitive that parallelizes sequential prefix-sum computations. Because state-space recurrences are associative, they can be calculated in $O(\log L)$ parallel steps on GPUs instead of $O(L)$ sequential steps.

Citation

@article{gu2023mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}