Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 11: Scaling Laws

By Unknown Author

Share:

Key Concepts

  • Scaling Laws: Empirical relationships predicting model performance (loss/accuracy) based on compute, data, and model size.
  • MUP (Maximal Update Parameterization): A technique to stabilize optimal learning rates across different model widths by adjusting initializations and per-parameter learning rates.
  • WSD (Warm-up Stable Decay): A learning rate schedule consisting of a constant "stable" phase followed by a rapid decay, allowing for efficient model checkpoint reuse.
  • Muon: A matrix-valued optimizer that uses Newton-Schultz iterations to orthogonalize updates, showing significant performance gains in small-scale experiments and recent large-scale models (e.g., Kimi k2).
  • Chinchilla Scaling: The methodology of balancing model size and training data to achieve optimal performance for a given compute budget.
  • Hyperparameter Drift: The phenomenon where optimal hyperparameters (learning rate, batch size) change as a function of model scale.

1. Scaling Strategies at the Frontier

The lecture contrasts two primary philosophies for managing hyperparameter sensitivity during scaling:

  • The MUP Approach (e.g., Mini CPM): Focuses on reparameterizing the model (initializations, residual connections, and learning rates) so that the optimal learning rate remains invariant as the model scales. This reduces the need for extensive grid searches.
  • The Scaling Law Fitting Approach (e.g., DeepSeek, Qwen, StepFun): Involves running extensive grid searches at smaller scales to identify optimal batch sizes and learning rates, then fitting power-law curves to extrapolate these values to larger compute regimes.

2. Optimization and Hyperparameter Tuning

  • Batch Size Scaling: Research indicates that optimal batch size follows a power-law structure relative to the target loss or the amount of training data.
  • Learning Rate Dynamics: While MUP aims for stability, empirical studies (e.g., StepFun) show that optimal learning rates often decrease as model size increases and increase with larger data volumes.
  • WSD Schedule: This "trapezoidal" schedule is highly favored for scaling experiments. By keeping the learning rate constant for the majority of training, researchers can "rewind" to the stable phase and continue training with different decay schedules, avoiding the need to restart from scratch.

3. Advanced Optimizers: The Muon Case Study

  • Mechanism: Muon treats matrix-valued parameters differently from vector-valued ones. It uses Newton-Schultz iterations to perform an approximate orthogonalization of the update matrix, effectively operating in the spectral norm rather than coordinate-wise (like Adam).
  • Scaling Challenges: Small-scale benchmarks (e.g., NanoGPT) showed Muon significantly outperforming Adam. However, scaling it to large models introduced instabilities. The successful deployment of Muon in Kimi k2 demonstrates that while it is a powerful tool, it requires careful tuning and "bells and whistles" to prevent divergence at scale.

4. Methodological Frameworks

  • Isoflops Analysis: A standard method where the compute budget is fixed, and the trade-off between model size and token count is varied to find the optimal configuration.
  • Physics-Inspired Derivations: MUP is derived by asserting two invariants:
    1. Activation Invariance: Activations at initialization should remain $O(1)$ regardless of width.
    2. Feature Learning: The change in activations after a gradient step should be $O(1)$ to ensure the model actually learns.
  • Grid Search vs. Extrapolation: While scaling laws provide a scientific framework, the lecturer emphasizes that they are often "vibes-based." Because small differences in architecture (e.g., SwiGLU, RMSNorm, weight decay) can break theoretical assumptions, practitioners often perform univariate sweeps to ensure local optimality.

5. Notable Quotes and Perspectives

  • "Scaling laws kind of have this very scientific feel to them... but ultimately, a big part of scaling laws is still vibes." — Tatsu Hashimoto, highlighting the uncertainty in extrapolating small-scale experiments to frontier models.
  • "If you get the MUP right... you get exactly the learning rate optimality invariance." — Noting the success of MUP in stabilizing training.
  • "It is really hard to know whether something works at scale. It is very, very hard." — A recurring theme regarding the gap between small-scale algorithmic breakthroughs and large-scale production stability.

Synthesis/Conclusion

Scaling language models is an iterative process of balancing theoretical rigor with empirical "art." While techniques like MUP and WSD provide robust ways to manage hyperparameter drift and compute efficiency, there is no universal "silver bullet." The current industry standard involves a hybrid approach: using scaling laws to predict optimal batch sizes and learning rates, while remaining vigilant for architectural nuances that may cause these laws to deviate or fail at extreme scales. The transition toward sparse models (MoE) and advanced optimizers like Muon represents the next frontier in scaling research.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video