Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Key Concepts

State Space Models (SSMs): A class of sequence models (e.g., Mamba, Mamba 2, Gated DeltaNet) that use linear-time complexity and fixed-size hidden states to process sequences.
Transformers: Models utilizing quadratic-time attention mechanisms that cache every past token (KV cache) to perform pairwise comparisons.
Autoregressive Modeling: A paradigm where models predict the next token in a sequence based on previous tokens, serving as the primary benchmark for comparing SSMs and Transformers.
Implicit Compression: The process by which SSMs "squish" sequence information into a fixed-size state, contrasting with the "database-like" storage of Transformers.
Tokenization: The process of breaking raw data (text, bytes) into discrete chunks. The speaker argues that tokenization is a form of feature engineering that Transformers rely on, whereas SSMs can operate more effectively on raw, un-tokenized data.
HNet (Hierarchical Network): A model architecture that performs end-to-end dynamic chunking of raw data, effectively learning its own "tokenization" process.
Inductive Bias: The set of assumptions a model uses to predict outputs for inputs it has not encountered; SSMs and Transformers possess fundamentally different biases regarding data resolution and abstraction.

1. Trade-offs: SSMs vs. Transformers

The speaker posits that the debate between SSMs and Transformers is often framed incorrectly around efficiency. While SSMs are linear and Transformers are quadratic, the core difference lies in their inductive biases and how they handle information:

Transformers (The Database Analogy): They store a representation of every past token in a KV cache. This makes them exceptionally strong at retrieval and recall tasks where precise, fine-grained access to past data is required. However, they are "beholden" to the resolution of the tokens provided.
SSMs (The Brain Analogy): They compress information into a fixed-size state. This makes them efficient and capable of online, real-time processing. Their weakness is a lack of fine-grained recall, similar to human memory, which struggles with exact string retrieval but excels at abstraction.

2. The Role of Tokenization and Data Resolution

A central argument is that Transformers are highly dependent on the quality of tokenization.

The "Bitter Lesson": Models that learn end-to-end from raw data (without manual feature engineering like BPE) tend to scale better.
Evidence: In experiments with byte-level language modeling and DNA sequence modeling, SSMs significantly outperform Transformers. When data is not semantically "clean" (like individual characters or DNA base pairs), the Transformer’s quadratic attention mechanism fails to capture meaningful patterns as effectively as the compressive, stateful approach of SSMs.

3. HNet: End-to-End Dynamic Chunking

The speaker introduced HNet, a hierarchical network designed to eliminate the need for offline tokenization.

Methodology: HNet uses a routing mechanism to predict "boundaries" within raw data (e.g., bytes). It compresses these chunks into higher-level representations, which are then processed by a main sequence model.
Key Finding: Even when operating on BPE tokens, incorporating SSM layers into the encoder/decoder stages improves performance compared to pure Transformer architectures. This suggests that the "compressive" nature of SSMs is beneficial for creating abstractions, regardless of the input resolution.

4. Hybrid Models

The industry has moved toward hybrid architectures (e.g., Jamba, Zamba, Samba, Qwen, Olmo) that interleave linear SSM layers with quadratic attention layers.

Optimal Ratio: Empirical ablations suggest a ratio of roughly 3:1 or 4:1 (SSM layers to attention layers) is optimal.
Perspective: This mirrors the human cognitive model where the "brain" (SSM) performs the bulk of the processing, while "external tools/databases" (Attention) are used selectively for retrieval.

5. Synthesis and Conclusion

The speaker concludes that the efficiency argument is a "red herring." The future of AI architecture design should focus on creating "better black boxes" that convert compute into intelligence more effectively.

Actionable Insight: When designing models, one must consider if the data requires the "database" capabilities of a Transformer or the "compressive/abstractive" capabilities of an SSM.
Future Outlook: There is significant room for improvement in architecture design, particularly in moving toward end-to-end models that learn to chunk and abstract data without relying on brittle, human-engineered tokenization.

Additional Industry Application: Vision RAG

The presentation concluded with a practical application of these concepts via MongoDB Atlas:

Problem: Traditional RAG (Retrieval-Augmented Generation) pipelines often lose information during OCR (Optical Character Recognition) of complex documents (e.g., insurance claims with photos and tables).
Solution (Vision RAG): By using a single multimodal encoder (e.g., Voyage Multimodal 3.5), both text and images are embedded into a unified vector space. This allows for direct visual similarity search without the need for OCR, keeping the vector data, metadata, and original source in a single, unified document store.