Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 9: Scaling Laws

By Stanford Online

Share:

Key Concepts

  • Scaling Laws: Empirical predictive rules that describe how model performance (typically test loss) improves as a function of resources (compute, data, parameters).
  • Power Law Relationship: The observation that on a log-log plot, the relationship between resources and error is linear, implying polynomial decay of error.
  • Critical Batch Size: The threshold where training transitions from being variance-limited (where increasing batch size yields perfect scaling) to bias-limited (where diminishing returns occur).
  • Isoflops: A methodology for optimizing hyperparameters by fixing a compute budget and sweeping across other variables (e.g., model size vs. data size) to find the optimal configuration.
  • Chinchilla Scaling: A landmark study (Hoffmann et al.) that corrected earlier assumptions, establishing that for a given compute budget, models should be trained on significantly more data than previously thought (approx. 20 tokens per parameter).
  • Upstream vs. Downstream Performance: The distinction between model performance on pre-training objectives (perplexity/loss) and performance on specific downstream tasks, which may not always correlate linearly.

1. The Engineering Paradigm of Scaling Laws

Scaling laws serve as a powerful tool for large-scale model development, allowing engineers to optimize hyperparameters and architectures at a small scale and extrapolate results to massive, expensive training runs. This avoids the "wasteful" approach of tuning on multi-million dollar runs.

  • Predictability: By fitting curves to small-scale experiments, researchers can predict the performance of models orders of magnitude larger.
  • Historical Context: Scaling laws are not new; they have roots in 1990s machine learning theory (e.g., Vapnik, Cortez) regarding sample complexity and error decay.
  • The "Belief" System: In frontier labs, scaling laws are treated as a paradigm—a way of life—where interventions (like architecture changes) are only considered valid if they demonstrate improved scaling trends.

2. Data Scaling Laws

Data scaling laws are the most fundamental, univariate relationships in deep learning.

  • Functional Form: Error typically decays polynomially ($1/n^\alpha$).
  • Non-parametric Analogy: Neural networks often behave like non-parametric regressors in high-dimensional spaces, with learning rates determined by the intrinsic dimension of the task.
  • Data Mixture & Repetition: While slopes of scaling laws are often determined by the model class, intercepts are determined by data quality. Repeating data (epochs > 1) eventually leads to a "dark curve" where performance degrades compared to fresh data.

3. Model Engineering and Hyperparameters

Scaling laws provide a quantitative framework for making architectural and optimization trade-offs:

  • Architecture Selection: By training smaller versions of different architectures (e.g., Transformers vs. LSTMs) across compute ranges, one can identify which architecture scales better. If an architecture has a worse slope, it will inevitably perform worse at larger scales.
  • Batch Size & Learning Rate:
    • Critical Batch Size: A rule of thumb to balance convergence speed and parallelization efficiency. It scales as a power law with respect to the target loss.
    • Learning Rate: Generally, as model width increases, the learning rate should decrease. Techniques like MUP (Maximal Update Parameterization) attempt to reparameterize models so that the optimal learning rate remains constant across scales.

4. The Chinchilla vs. Kaplan Discrepancy

The transition from the Kaplan (OpenAI) era to the Chinchilla (DeepMind) era highlights the sensitivity of scaling laws to implementation details:

  • The Disagreement: Kaplan suggested training massive models with fewer tokens; Chinchilla argued for smaller models trained on much more data.
  • Root Causes: The discrepancy was traced to:
    1. Parameter Counting: Excluding embedding/softmax layers skewed the results.
    2. Convergence: Kaplan’s smaller models were not fully converged during the warm-up phase.
    3. Batch Size: Using fixed, sub-optimal batch sizes for smaller models.
  • Lesson: Scaling laws are "engineered" rather than magical; they are highly sensitive to the specific experimental setup and hyperparameter tuning.

5. Synthesis and Conclusion

Scaling laws are essential for evidence-driven engineering at scale. While they provide a robust way to predict performance and optimize resource allocation, they are not infallible.

  • Actionable Insight: When designing a production model, prioritize "overtraining" (training on more data than the Chinchilla-optimal ratio) to optimize for inference efficiency, as serving costs often outweigh training costs in production environments.
  • Final Warning: Always be skeptical of extrapolations. If the compute range is too narrow, it is difficult to distinguish between polynomial and exponential scaling. Furthermore, always validate that upstream perplexity improvements actually transfer to downstream task performance.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video