Stanford CS25: Transformers United V6 I The Ultra-Scale Talk: Scaling Training to Thousands of GPUs

By Unknown Author

Share:

Key Concepts

  • Data Parallelism (DP): Sharding data across GPUs while replicating the model.
  • Zero Redundancy Optimizer (ZeRO): Techniques (ZeRO-1, 2, 3) to shard optimizer states, gradients, and parameters to reduce memory footprint.
  • Tensor Parallelism (TP): Sharding individual model layers (matrix multiplications) across GPUs.
  • Pipeline Parallelism (PP): Sharding model layers vertically across different GPUs.
  • Sequence Parallelism (SP): Sharding activations along the sequence dimension to handle long-context training.
  • Expert Parallelism (EP): Distributing Mixture-of-Experts (MoE) layers across GPUs using all-to-all communication.
  • Communication Overlap: The practice of scheduling communication (all-reduce, all-gather) concurrently with computation to minimize GPU idle time.

1. Scaling Training to Thousands of GPUs

The primary motivation for scaling is the correlation between model size/training tokens and intelligence. Training trillion-parameter models requires managing massive data throughput (15 trillion tokens) and memory constraints. The goal is to ensure the GPU computation stream remains saturated while minimizing idle time caused by communication overhead.

2. Data Parallelism and ZeRO Optimizations

  • Vanilla DP: Replicates the model on every GPU. It is model-agnostic but suffers from duplicated optimizer states and memory bottlenecks.
  • ZeRO-1: Shards optimizer states across GPUs, reducing memory usage without increasing communication overhead significantly.
  • ZeRO-2: Shards both optimizer states and gradients.
  • ZeRO-3 (FSDP): Shards parameters, gradients, and optimizer states. It uses a "prefetching" mechanism where parameters are gathered only when needed for a specific layer's forward/backward pass and then freed, effectively trading communication for memory efficiency.

3. Tensor and Sequence Parallelism

  • Tensor Parallelism (TP): Splits matrix multiplications (e.g., MLP blocks, QKV projections) across GPUs. It requires an all-reduce operation to maintain mathematical correctness.
  • Sequence Parallelism (SP): Addresses the memory explosion of long-context training by sharding activations along the sequence dimension. It utilizes the inverse relationship between reduce-scatter and all-gather to synchronize gradients without needing an explicit all-reduce for layer norms.

4. Pipeline Parallelism (PP)

PP shards the model vertically (layer-by-layer).

  • The "Bubble" Problem: GPUs often sit idle waiting for activations from previous stages.
  • Solutions: Advanced schedulers like "1F1B" (one-forward, one-backward) or DeepSeek’s "DualPipe" help interleave operations to minimize idle time.
  • Trade-off: Requires saving activations for multiple micro-batches, necessitating activation checkpointing or CPU offloading.

5. Expert Parallelism (EP)

Used for Mixture-of-Experts (MoE) models.

  • Mechanism: Experts are sharded across GPUs. Tokens are routed to specific experts using an all-to-all communication operation.
  • Challenges: High communication complexity ($O(n^2)$) and the need for CPU-GPU synchronization to determine routing.
  • Hardware Dependency: Efficient EP often requires high-speed interconnects like InfiniBand (IB) with RDMA to avoid the bottleneck of CPU-side routing calculations.

6. Synthesis and Best Practices

  • Orthogonality: These five parallelism strategies are orthogonal and can be combined (e.g., using TP within a node and DP across nodes).
  • Communication Strategy: Keep communication-heavy operations (like TP) within a single node to leverage high-bandwidth interconnects (e.g., NVLink).
  • Load Balancing: In MoE models, use load-balancing loss functions to prevent token-routing imbalances that leave some GPUs idle.
  • Actionable Insight: Do not use the most complex parallelism (like ZeRO-3 or EP) if simpler methods (like ZeRO-1) suffice, as unnecessary communication will degrade training speed.

"The biggest thing that we want is we don't want GPU to stay idle... we definitely want the GPU computation stream to be always full." — Nuaman Tazzy

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video