Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 4: Attention Alternatives

Key Concepts

Linear Time Attention: Architectural modifications to attention mechanisms to achieve linear (rather than quadratic) time complexity relative to sequence length.
Mixture of Experts (MoE): A sparse model architecture where only a subset of parameters (experts) is activated per token, allowing for high parameter counts with lower computational (FLOPs) costs.
Associativity of Multiplication: The mathematical property used to reorder attention operations ($Q(K^T V)$ instead of $(QK^T)V$) to reduce complexity.
State Space Models (SSMs): Models like Mamba 2 that use recurrent formulations to maintain a fixed-size state, enabling efficient inference.
Gated DeltaNet: A state space model variant using dual gating (input-dependent gates) to control information flow and state updates.
Top-K Routing: A mechanism where a router selects the $K$ most relevant experts for a given token.
Expert/Load Balancing: Heuristic losses added during training to prevent "expert collapse" (where only a few experts are used) and ensure even distribution across devices.
Multi-Head Latent Attention (MLA): A technique to compress the KV cache by storing latent vectors ($C$) instead of full $K$ and $V$ matrices.

1. Advanced Attention Architectures

The lecture addresses the "quadratic bottleneck" of standard Transformer attention, which becomes prohibitive as context lengths reach millions of tokens.

Linear Attention: By dropping the softmax row or reordering matrix multiplications, attention can be computed in linear time. This allows for a dual formulation: a parallel dense matrix multiply for training and a serial RNN-like form for efficient inference.
Hybrid Architectures: Current state-of-the-art models (e.g., Minimax M1, NeMo Tron 3, Qwen 3.5) use hybrid designs, alternating between linear/RNN-based layers and full quadratic softmax attention layers to balance performance and cost.
Sparse Attention (DSA): DeepSeek’s approach uses a lightweight "indexer" to select a subset of tokens for full attention. While the indexer itself is quadratic, it operates on lower-dimensional projections, significantly reducing the constant factors of the overall computation.

2. Mixture of Experts (MoE)

MoEs decouple the number of parameters from the compute cost per forward pass.

Core Mechanism: The MLP (Feed-Forward Network) is replaced by multiple smaller FFNs ("experts"). A router determines which experts process a specific token.
Benefits: Increased parameter count improves model quality without increasing inference FLOPs. It also introduces "Expert Parallelism," allowing different experts to reside on different hardware devices.
Shared Experts: A design pioneered by DeepSeek where specific experts are always active for every token, while others are conditionally routed. This allows the model to handle common processing tasks efficiently while reserving specialized experts for complex inputs.

3. Training Methodologies & Heuristics

Training sparse models is non-trivial due to the non-differentiable nature of routing decisions.

Load Balancing Loss: To prevent "expert starvation" (where the model ignores most experts), researchers add an auxiliary loss that penalizes popular experts and encourages uniform distribution of tokens.
Stochasticity: Early methods used noise injection to encourage exploration during routing, though modern practices rely more heavily on balancing heuristics.
Upcycling: A technique (now less common) where a pre-trained dense model is converted into an MoE by duplicating its MLP layers and initializing a router, allowing for a "free" transition to a larger sparse model.

4. Systems Engineering & Optimization

The lecture emphasizes that "constant factors" and hardware-aware design are as critical as theoretical complexity.

Flash Attention: A systems-level optimization that rearranges attention operations to minimize memory transfer overhead, providing significant speedups without changing the underlying math.
Communication Bottlenecks: In MoE training, shipping activations between devices creates communication overhead. Techniques like down-projecting the residual stream before communication help mitigate this.
Dropless Architectures: Modern frameworks (e.g., MegaBlocks) have eliminated the need to silently drop tokens when an expert queue becomes full, improving training stability.

5. Notable Quotes & Perspectives

On the "Bitter Lesson" of Scaling: "It seems to be the case that for whatever reason, if you keep the total compute the same, but you just increase the number of sparse parameters... the models are generally getting better."
On Architecture Evolution: "I don't know if I have good predictions for what the future attention architecture looks like... I think a lot of this will look like we throw all of these tricks in."
On the Role of Heuristics: "It is not a thing where you would derive it from first principles... it's a collection of really interesting heuristics that end up working."

Synthesis/Conclusion

The evolution of modern language models is moving toward hybridization and sparsity. By combining linear-time attention (for long-context efficiency) with MoE architectures (for parameter scaling), researchers are building models that are both computationally efficient and highly expressive. The field has shifted from purely theoretical designs to a "systems-first" approach, where auxiliary losses, hardware-aware routing, and latent attention mechanisms are essential to maintaining stability and performance at scale.