Stanford CS231N | Spring 2025 | Lecture 11: Large Scale Distributed Training

CS231n Lecture 11: Large-Scale Distributed Training - Summary

Key Concepts:

GPU Hardware (H100, Tensor Cores, Memory Hierarchy)
GPU Clusters (Server Racks, GPU Pods)
Degrees of Parallelism (Data Parallelism, Fully Sharded Data Parallelism, Hybrid Sharded Data Parallelism, Tensor Parallelism, Pipeline Parallelism, Context Parallelism)
Activation Checkpointing
Model Flops Utilization (MFU)
Hardware Flops Utilization (HFU)

1. GPU Hardware

GPU (Graphics Processing Unit): Specialized coprocessors initially for computer graphics, now generalizable parallel processors. NVIDIA is the leading manufacturer.
NVIDIA H100: Current mainstay for deep learning training.
- Contains compute cores and 80 GB of HBM (High-Bandwidth Memory).
- Memory bandwidth of ~3 terabytes per second.
- 50 MB of L2 cache close to compute elements for faster access.
- 132 Streaming Multiprocessors (SMs): Independent parallel cores, roughly akin to CPU cores but with different strengths and weaknesses.
- Binning: GPU hardware manufacturing process where imperfect chips are sold with only a guaranteed minimum number of functional cores (e.g., 132 out of 144).
Streaming Multiprocessor (SM) Details:
- 256 KB of L1 cache and register files.
- Memory Hierarchy: L1 cache (256 KB), L2 cache (50 MB), HBM (80 GB).
- 128 FP32 Cores: Each can compute ax + b in one clock cycle (256 floating point operations per SM per clock cycle).
- 4 Tensor Cores: Specialized circuits for matrix multiply (A[16x4] * B[4x8] + bias[16x8]) - 1,024 floating point operations per tensor core per clock cycle (4,096 floating point operations per SM per clock cycle).
- Tensor cores operate in mixed precision (16-bit inputs, 32-bit accumulations).
Performance Increase:
- K40 (2013): 5 teraflops of FP32 compute.
- B200 (current): 83.3 teraflops of FP32 compute, 5,000 teraflops of mixed-precision compute (tensor cores).
- ~1,000x increase in computation in 12 years.
Importance of Tensor Cores: The introduction of tensor cores in the V100 GPU significantly increased the throughput of these devices.

2. GPU Clusters

Hierarchy:
- Single H100 GPU: 3 terabytes/second memory bandwidth.
- GPU Server: Typically 8 GPUs, ~900 gigabytes/second GPU-to-GPU communication.
- Server Rack: Two GPU servers (16 GPUs total).
- GPU Pod: 192 racks (3,072 GPUs), ~50 gigabytes/second GPU-to-GPU communication.
- GPU Cluster: Multiple GPU pods (e.g., Meta's Llama3 cluster with 8 pods, 24,576 GPUs).
Llama3 Cluster Details: Meta's Llama3 training cluster combines eight GPU pods for a total of 24,576 GPUs.
Data Center as a Computer: Thinking of the entire data center as one giant computer with combined resources (e.g., 24 exaflops of compute).
Training Time: Large models typically train for months.
Physical Space: A single server rack is about the size of a podium. A pod contains 192 racks.
Cooling: Significant cooling requirements due to the heat generated by tens of thousands of GPUs.
Alternative Hardware:
- Google TPUs (Tensor Processing Units): Competitive hardware, used for Google's Gemini models. Accessible via Google Cloud.
- AMD MI325X: Training accelerator with comparable specs to H100.
- AWS Trainium: Training chip used by Anthropic.

3. Degrees of Parallelism

Goal: Split up computation and communication to maximize GPU utilization.
Five Degrees of Parallelism (for Transformers):
- Data Parallelism (DP): Split the minibatch across GPUs.
- Fully Sharded Data Parallelism (FSDP): Split the model weights across GPUs.
- Hybrid Sharded Data Parallelism (HSDP): Combine FSDP and DP.
- Context Parallelism: Split the sequence dimension across GPUs.
- Pipeline Parallelism: Split the layers across GPUs.
- Tensor Parallelism: Split the weight matrices across GPUs.

3.1 Data Parallelism (DP)

Each GPU maintains its own copy of the model weights, optimizer state, and gradients.
Each GPU loads a different minibatch of data.
Independent forward and backward passes on each GPU.
All-Reduce Operation: After the backward pass, gradients are averaged across all GPUs using an all-reduce operation.
Weight update on each GPU with the averaged gradients.
Backward pass and gradient communication can happen simultaneously.
Limitation: Model size is limited by the memory of a single GPU.

3.2 Fully Sharded Data Parallelism (FSDP)

Each model weight is assigned to an "owner" GPU.
The owner GPU manages the global gradients and optimizer state for that weight.
At the beginning of the forward pass, the owner GPU broadcasts the weight to all other GPUs.
After the forward pass, non-owner GPUs delete their local copy of the weight.
During the backward pass, GPUs compute local gradients and send them to the owner GPU.
The owner GPU aggregates the gradients and updates the weight.
Weight update is only performed by the owner of the weight matrix.
Communication of weights and computation of backward pass can happen in parallel.
Optimization: Keep the weights for the last layer in memory to avoid re-broadcasting during the backward pass.

3.3 Hybrid Sharded Data Parallelism (HSDP)

Combines FSDP and DP.
GPUs are divided into a two-dimensional grid.
FSDP is used within groups of GPUs.
DP is used across the groups.
Takes advantage of different communication speeds within and across servers.
Example: FSDP within a server (8 GPUs), DP across servers.

3.4 Activation Checkpointing

Addresses the memory bottleneck caused by storing activations.
Recomputes activations during the backward pass instead of storing them.
Trade-off between computation and memory.
Full Recomputation: O(N^2) compute, O(1) memory.
Checkpointing Every C Layers: O(N^2/C) compute, O(C) memory.
Common setting: C = sqrt(N), resulting in O(N*sqrt(N)) compute and O(sqrt(N)) memory.

3.5 Context Parallelism

Splits the sequence dimension across GPUs.
Transformer blocks (LayerNorm, FFN, MLP, residual connections) operate independently across the sequence.
Attention mechanism is more complex to parallelize.
Ring Attention: Chunks the attention matrix into blocks and distributes them across GPUs.
Ulysses Attention: Parallelizes the computation of the attention operator over the heads.
Used when sequence lengths are very large (e.g., Llama3 with sequence length of 130,000).

3.6 Pipeline Parallelism

Splits the layers across GPUs.
Sequential dependencies between layers create "bubbles" of idle GPUs.
Microbatching: Run multiple microbatches simultaneously to reduce the bubble size.
Trade-off between MFU and memory usage.

3.7 Tensor Parallelism

Splits weight matrices across GPUs.
Each GPU computes a slice of the matrix multiply.
Requires communication to gather activations.
Two-Layer Tensor Parallelism: Splits the first weight matrix into column-shaped chunks and the second weight matrix into row-shaped chunks to avoid communication between layers.
Commonly used on the MLP in a transformer.

3.8 N-D Parallelism

Combines multiple parallelism techniques (e.g., tensor parallelism, context parallelism, pipeline parallelism, data parallelism).
Llama3 uses 8-way tensor parallelism, 16-way context parallelism, 16-way pipeline parallelism, and 8-way data parallelism.

4. Scaling Recipe

Up to 128 GPUs, 1 Billion Parameters: Data Parallelism. Maximize local batch size per GPU.
More than 1 Billion Parameters: Fully Sharded Data Parallelism.
Memory Bottleneck for Activations: Activation Checkpointing.
More than 256-512 GPUs: Hybrid Sharded Data Parallelism.
More than 1000 GPUs, 50 Billion Parameters, Sequence Lengths > 10,000: Context Parallelism, Pipeline Parallelism, Tensor Parallelism.

5. Model Flops Utilization (MFU)

Guiding Light: Optimize MFU to improve training throughput.
Hardware Flops Utilization (HFU): Fraction of theoretical maximum throughput achieved on a single device.
MFU Definition: Fraction of the GPU's theoretical TFLOPs used for forward and backward passes in the model.
Calculation:
- Compute the number of flops for a full forward and backward pass.
- Look up the peak theoretical throughput of the device.
- Divide the number of flops by the peak throughput to get the theoretical minimum time.
- Time an actual forward and backward pass.
- Divide the theoretical minimum time by the actual time.
MFU Guidelines:
- Above 30%: Pretty good.
- Above 40%: Excellent.
Llama3 MFU: High 30s, low 40s.
Paradox: More recent devices sometimes get worse MFUs due to the growing gap between compute and communication speeds.

6. Conclusion

Individual GPUs are generalizable parallel computing machines. GPU clusters are giant massively parallel machines. Effective large-scale distributed training requires careful consideration of hardware, parallelism strategies, and optimization of Model Flops Utilization.