Stanford CS336 Language Modeling from Scratch | Spring 2025 | Scaling laws 2

By Unknown Author

AIBusinessTechnology
Share:

Scaling Laws Lecture 2: Case Studies and MUP - Detailed Summary

Key Concepts:

  • Scaling Laws
  • Maximum Update Parameterization (MUP)
  • Warm-up Stable Decay (WSD) Learning Rate
  • Isoflop Analysis
  • Chinchilla Scaling
  • Model Parameterization
  • Learning Rate Scaling
  • Batch Size Optimization
  • Activation Stability
  • Gradient Stability

1. Introduction

This lecture focuses on practical applications of scaling laws in large language model (LLM) development. It covers case studies of models like Cerebrus GPT, MiniCPM, and DeepSeek LLM, highlighting their scaling strategies. The lecture also delves into the Maximum Update Parameterization (MUP) method, explaining its mathematical foundations and empirical validation.

2. Motivation and Skepticism

The lecture addresses skepticism surrounding scaling laws, questioning their reliability and applicability in real-world scenarios. Key questions include:

  • Does Chinchilla's approach to scaling laws actually work?
  • Can ISO flop fitting accurately predict token trade-offs?
  • Can scaling laws optimize learning rates?
  • Should specific architectures or parameterizations be chosen for better scaling?

3. The Post-Chinchilla Landscape

Following the DeepMind Chinchilla paper, the LLM landscape became more secretive, with leading labs hesitant to share scaling strategies. This necessitates relying on publicly available information from competently executed large-scale models.

4. Case Studies of Scaling Strategies

The lecture focuses on three models: Cerebrus GPT, MiniCPM, and DeepSeek LLM. Each model employs a different mix of scaling strategies, offering valuable insights into achieving effective scaling.

4.1 Cerebrus GPT

  • Model Family: 0.1 to 13 billion parameter models.
  • Training Recipe: Chinchilla recipe (optimal token-to-parameter ratio).
  • Core Finding: MUP stabilizes scaling and improves performance.
  • MUP Validation: Cerebrus GPT provides one of the first public validations of MUP.
  • Standard Parameterization (SP) vs. MUP: SP exhibits oscillations around the predicted scaling point due to learning rate adjustments, while MUP achieves scaling closer to the predicted law.
  • Implementation Details: The appendix of the Cerebrus GPT paper provides a table detailing the differences between SP and MUP initialization and parameterization.
    • Non-embedding parameters are initialized with one over the width.
    • Per-layer learning rates are scaled down by one over the width.
  • Aggressive Scaling: Cerebrus GPT combines MUP with aggressive scaling, conducting extensive hyperparameter search on a 40 million parameter proxy model and scaling up using MUP.

4.2 MiniCPM

  • Goal: Train high-quality small language models using significant compute.
  • Scaling Strategy: Employs MUP to stabilize scaling and simplify hyperparameter tuning.
  • Model Size: 1.2 to 2.4 billion parameters.
  • Performance: Outperforms most 2B models and matches many modern 7B models (as of mid-2024).
  • MUP Implementation: Similar to Cerebrus GPT, MiniCPM scales embeddings by a constant, residual connections by the square root of the number of layers, and initializes weights by one over the base width. Learning rates are also scaled by the width of the model.
  • Batch Size Optimization: MiniCPM aims to determine the critical batch size, which represents the diminishing returns point.
  • Kaplan Paper Replication: MiniCPM reproduces a similar plot to the Kaplan paper, identifying a log-log linear relationship between target loss and critical batch size.
  • Learning Rate Stability: MUP ensures that the optimal learning rate remains stable across different model sizes.
  • WSD Learning Rate: MiniCPM popularized the Warm-up Stable Decay (WSD) learning rate schedule, which consists of three phases: warm-up, stable, and decay.
    • Advantage: Allows for data scaling experiments in a single training run by rewinding checkpoints and applying a cool-down phase.
    • Comparison to Cosine Learning Rate: WSD is comparable to cosine learning rates but offers the advantage of flexible termination points.
  • Chinchilla Analysis: MiniCPM performs a Chinchilla analysis using WSD, varying the number of tokens and model sizes.
    • Methods: Employs method one (overlaying learning curves and taking the lower envelope) and method three (jointly fitting a two-variable scaling law).
    • Token-to-Parameter Ratio: MiniCPM estimates a very high token-to-parameter ratio of 192 tokens per parameter, which is significantly higher than most other literature.
    • Conclusion: The Chinchilla analysis isn't a strong constraint, and the 20x model size rule of thumb is just a starting point.

4.3 DeepSeek LLM

  • Model Size: 7 and 67 billion parameter models.
  • Performance: Matches the performance of Llama (as of 2024).
  • Scaling Strategy: Does not use MUP. Directly estimates optimal batch size and learning rate.
  • Hyperparameter Optimization: Trains models with different batch sizes and learning rates, identifying optimal values through grid search.
  • Scaling Law Fitting: Fits scaling laws to batch size and learning rate, extrapolating to larger models.
  • WSD Learning Rate: Employs a WSD learning rate with two decay phases.
  • Chinchilla Analysis: Replicates Chinchilla analysis to determine optimal token size and model size trade-offs.
  • Scaling Law Validation: Extrapolates from smaller models to predict the performance of 7B and 67B models, achieving accurate predictions.

5. Recent Models and Scaling Insights

  • Llama 3: Replicates ISO flop scaling and finds an optimal token-to-parameter ratio of approximately 39 to 1. Correlates compute to NLL (negative log-likelihood) and then to downstream accuracies.
  • Hunan Large: Replicates Chinchilla analysis and obtains a data-to-active parameter ratio of 96 to 1.
  • Minimax 01: Justifies architecture choices (linear time attention) through the lens of scaling laws.

6. Common Ingredients in Scaling Recipes

  • Cerebrus GPT and MiniCPM use MUP to stabilize hyperparameters.
  • MiniCPM uses WSD for efficient Chinchilla-style scaling.
  • DeepSeek directly fits scaling laws to batch size and learning rate.
  • Isoflop analysis is used to determine model sizing.

7. Understanding MUP: Mathematical Derivation

MUP aims to achieve scale-invariant hyperparameters by manipulating initialization and per-layer learning rates.

7.1 Core Conceptual Objects

  • Activation Stability: Activations at initialization should remain roughly constant as model width increases.
  • Gradient Stability: The change in activation after one gradient step should also remain roughly constant.

7.2 Derivation Steps

  1. Deep Linear Network: Consider a deep linear network with Gaussian initialization.
  2. Activation Size: Determine the size of activations at initialization.
  3. Matrix Concentration Limit: Apply random matrix theory to approximate the operator norm of the weight matrix.
  4. Initialization Choice: Pick a specific initialization scale (sigma) based on fan-in and aspect ratio.
  5. Inductive Proof: Inductively prove that every layer has the correct activation size.
  6. Learning Rate Derivation: Analyze the change in activation after one gradient step.
  7. Loss Stability Assumption: Assume that the change in loss after one gradient step is also big theta of one.
  8. Learning Rate Formula: Derive the learning rate formula for SGD (fan-out over fan-in) and Adam (one over fan-in).

7.3 MUP Implementation Summary

  • Initialization: Initialize Gaussian with one over the square root of fan-in, with a correction factor for aspect ratio.
  • Learning Rate: For SGD, set learning rate to fan-out over fan-in. For Adam, set learning rate to one over fan-in.

8. Empirical Validation of MUP

A large-scale exploration of MUP transfer investigates its robustness to various architectural modifications.

8.1 Experimental Setup

  • Width Scaling: Primarily focuses on width scaling while keeping depth fixed.
  • Attention Scaling: Uses one over D scaling for attention activations instead of one over the square root of D.

8.2 Results

  • Nonlinearities: MUP works well with different nonlinearities (Swiglu, Squared ReLU, ReLU).
  • Batch Sizes: MUP is robust to variations in batch sizes.
  • Initializations: MUP is robust to different initialization schemes.
  • Limitations: MUP breaks down with learnable gains, exotic optimizers (Lion), and strong weight decay.
  • Large-Scale Validation: A large-scale experiment confirms that the optimal learning rate remains stable at 2e-6.

9. Conclusion

Scaling in the wild involves setting hyperparameters using scaling laws, employing MUP or assuming stability to avoid extensive search, and utilizing alternative learning schedules like WSD to reduce compute requirements. While MUP shows promise, it is not a universally adopted consensus.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Stanford CS336 Language Modeling from Scratch | Spring 2025 | Scaling laws 2". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video