Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 3: Architectures

Key Concepts

Architecture Design: The structural choices in transformer models (e.g., normalization, activation functions, position embeddings).
Pre-norm vs. Post-norm: The placement of Layer Normalization relative to the residual stream.
RMS Norm: A computationally efficient normalization technique that omits mean subtraction and bias.
Gated Linear Units (GLU): Activation functions (e.g., SwiGLU, GeGLU) that use gating mechanisms to improve performance.
RoPE (Rotary Positional Embeddings): A method for encoding relative position via rotation of vector pairs.
GQA (Grouped Query Attention): An attention variant that balances inference efficiency and expressive power.
KV Cache: A memory-efficient mechanism for storing past keys and values during autoregressive generation.
Stability Interventions: Techniques like Z-loss, QK-norm, and Logit Soft-capping to prevent training divergence.

1. Core Architecture Variations

The lecture emphasizes that while the original Transformer (Vaswani et al.) established the foundation, modern models have converged on specific modifications to improve stability and efficiency.

Layer Normalization Placement: Modern models almost universally use Pre-norm (placing the norm outside the residual stream). This keeps the residual stream "clean," facilitating better gradient propagation and stability.
Normalization Type: RMS Norm has replaced standard Layer Norm in most architectures. It provides equivalent performance while being faster by removing mean subtraction and bias terms, which are memory-intensive but offer little expressive gain.
Activation Functions: The industry has shifted toward Gated Linear Units (GLUs), specifically SwiGLU. These provide consistent performance gains. A common heuristic is to reduce the feed-forward dimension by 2/3 when using GLUs to maintain a constant parameter count.

2. Position Dependence

Because standard attention is position-agnostic, models require explicit encoding:

RoPE (Rotary Positional Embeddings): The current standard. It encodes relative position by rotating pairs of dimensions in the vector space. This ensures that inner products are invariant to absolute position, satisfying the requirement for relative position modeling.
Implementation: RoPE is implemented by applying sine/cosine rotations to queries and keys, effectively rotating the input vectors without introducing the "cross-terms" found in absolute embedding methods.

3. Hyperparameters and Scaling

The speaker notes that many hyperparameters exist in a "forgiving basin," where small variations do not significantly impact performance.

Feed-Forward Ratio: The standard rule of thumb is a 4x multiplier for the hidden dimension. For GLU variants, this is adjusted to ~2.67x to keep parameter counts consistent.
Aspect Ratio: The ratio of model dimension to the number of layers is typically kept around 100. This balances the trade-off between depth (expressiveness) and width (hardware parallelization).
Vocabulary Size: A clear divide exists: monolingual models (English-only) often use ~30k tokens, while modern multilingual/production models (e.g., GPT-4, Llama) use 100k–200k tokens to cover broader linguistic spaces.

4. Stability Interventions

As training costs rise, preventing "gradient spikes" and divergence is critical:

Z-loss: A regularization term that penalizes the log-normalizer of the output softmax, keeping it near zero to prevent numerical overflow.
QK-norm: Applying normalization to queries and keys before the attention dot-product. This keeps the inputs to the softmax stable.
Logit Soft-capping: Used in models like Gemma, this caps the logits before the softmax to prevent extreme values, though it may slightly degrade performance if the cap is too aggressive.

5. Inference Efficiency: GQA and Sliding Window

Grouped Query Attention (GQA): A middle ground between Multi-Head Attention (high performance, high cost) and Multi-Query Attention (lower performance, low cost). By grouping keys and values, GQA significantly reduces memory bandwidth requirements during inference (KV cache) while maintaining near-optimal performance.
Sliding Window Attention: Used to manage long context windows. Models often alternate between "full attention" layers and "local/sliding window" layers to aggregate information efficiently without the quadratic cost of full attention across all layers.

Synthesis and Conclusion

The evolution of Transformer architectures is driven by a constant tension between expressive power and systems efficiency. The "vanilla" Transformer has proven remarkably robust, with most modern innovations focusing on:

Systems-friendly operations: Moving to RMS Norm, dropping biases, and using GQA to keep GPUs "hot."
Stability: Implementing QK-norm and Z-loss to ensure long-duration training runs do not collapse.
Context Management: Using hybrid attention patterns (sliding window + full attention) to handle long sequences.

The speaker concludes that while theoretical tools are useful, the best way to understand these architectures is through empirical experimentation and observing the patterns established by successful large-scale models.