Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 7: Parallelism

Key Concepts

Collective Operations: Standardized communication primitives (Broadcast, Scatter, Gather, Reduce, All-Gather, Reduce-Scatter, All-Reduce, All-to-All) used to coordinate data/compute across multiple devices.
Parallelism Strategies:
- Data Parallelism (DDP): Splitting data across GPUs while replicating the model.
- Tensor Parallelism: Splitting individual model layers (e.g., matrix multiplication) across GPUs.
- Pipeline Parallelism: Splitting model layers sequentially across GPUs.
Hardware Interconnects: NVLink/NVSwitch (high-speed, intra-node), InfiniBand (inter-node), and Ethernet (standard networking).
RDMA (Remote Direct Memory Access): Technology allowing GPUs to read/write directly to other GPU memories, bypassing the CPU.
NCCL (NVIDIA Collective Communications Library): The low-level library that translates collective operations into hardware-specific kernels.
Pipeline Bubbles: Inefficiency in pipeline parallelism where GPUs sit idle waiting for data from previous stages.

1. Collective Operations: Building Blocks

Collective operations provide a template for communication across multiple "ranks" (devices).

Warm-up Operations:
- Broadcast: Sends data from one rank to all others.
- Scatter: Splits a tensor at one rank and distributes pieces to others.
- Gather: The inverse of scatter; collects pieces from all ranks to one.
- Reduce: Performs an associative operation (e.g., sum, min, max) on data across ranks and stores the result on one rank.
Core Training Operations:
- All-Gather: Gathers data from all ranks and distributes the full result to all ranks.
- Reduce-Scatter: Performs a reduction and then distributes the resulting shards across ranks.
- All-Reduce: Performs a reduction and replicates the result on all ranks. This is the backbone of Data Parallelism.
- All-to-All: A general communication pattern where each rank sends specific data to every other rank; essential for Mixture of Experts (MoE) models.

2. Hardware Hierarchy and Networking

The efficiency of distributed training is dictated by the physical distance and bandwidth between compute units:

Intra-node: GPUs are connected via NVLink and NVSwitch, providing massive bandwidth (e.g., 1.8 TB/s for NVLink 5).
Inter-node: Clusters use InfiniBand or Ethernet. Standard Ethernet is slower and typically requires CPU intervention, whereas RoCE (RDMA over Converged Ethernet) allows Ethernet to bypass the CPU, mimicking InfiniBand performance.
Scaling: As the number of GPUs increases, the communication overhead grows. The goal is to orchestrate computation to avoid data transfer bottlenecks, as HBM (High Bandwidth Memory) is significantly faster than network interconnects.

3. Parallelism Methodologies

Data Parallelism (DDP)

Mechanism: The batch is divided into $N$ pieces (where $N$ is the number of GPUs). Each GPU processes its local batch and computes gradients.
Synchronization: After the backward pass, an All-Reduce operation averages the gradients across all GPUs.
Result: Every GPU updates its local copy of the model parameters identically.

Tensor Parallelism

Mechanism: Individual layers (e.g., weight matrices) are split across GPUs.
Communication: Requires frequent communication (All-Gather/Reduce-Scatter) during the forward and backward passes to synchronize activations.
Constraint: Because of the high communication volume, this is typically restricted to intra-node setups using high-speed NVLink.

Pipeline Parallelism

Mechanism: The model is split by layers. GPU 0 handles layers 1–10, GPU 1 handles 11–20, etc.
Optimization: Uses micro-batches to keep all GPUs busy and reduce "pipeline bubbles" (idle time).
Communication: Uses point-to-point send and receive operations.

4. Key Arguments and Insights

Communication vs. Computation: The primary challenge in multi-GPU training is overlapping communication with computation. If a GPU is waiting for data from another, it is wasting cycles.
Hardware Dependency: The choice of parallelism is hardware-bound. Tensor parallelism is only viable with high-bandwidth interconnects (NVLink), while pipeline parallelism can tolerate slower networks.
Efficiency: "It is very easy to use a ton of GPUs, but it is hard to use them effectively." Effective scaling requires balancing the memory footprint of parameters/gradients against the communication cost of synchronization.

5. Synthesis/Conclusion

Distributed training is a balancing act between memory capacity and communication bandwidth. While Data Parallelism (DDP) is the most straightforward approach for scaling, it is limited by the memory capacity of a single GPU. Advanced techniques like Tensor and Pipeline parallelism allow for the training of models that exceed the memory of a single device, but they introduce significant complexity in bookkeeping and communication management. The future of scaling lies in hierarchical strategies—combining these methods to match the specific topology of the underlying hardware cluster.