Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 2: PyTorch (einops)

Key Concepts

Resource Accounting: The practice of managing compute, memory, and data constraints to maximize training efficiency.
Tensors: The fundamental data structures (storing parameters, gradients, optimizer states, activations).
Precision Formats: FP32 (Single Precision), FP16, BF16 (Brain Floating Point), FP8, and FP4.
Arithmetic Intensity: The ratio of floating-point operations (FLOPs) to bytes transferred; determines if a process is compute-bound or memory-bound.
Roofline Model: A visual framework to analyze hardware performance limits based on arithmetic intensity.
MFU (Model FLOPs Utilization): The ratio of actual achieved FLOPs/s to the theoretical peak FLOPs/s of the hardware.
Optimization Techniques: Gradient Accumulation and Activation Checkpointing (Rematerialization).

1. Resource Accounting and Hardware Metrics

The primary goal in training large models is maximizing computational efficiency under finite resources.

FLOPs vs. FLOPS: The lecturer distinguishes between flops (total floating-point operations performed) and FLOPS (floating-point operations per second, a hardware speed metric).
MFU Calculation: MFU is defined as $\frac{\text{Actual FLOPs/s}}{\text{Promised FLOPs/s}}$. A typical target for modern LLM training is ~0.5 (50%).
Hardware Specs: H100 GPUs have 80GB of HBM (High Bandwidth Memory). When calculating training time, one must account for the "promised" speed, which is often halved in practice due to dense vs. sparse matrix operations.

2. Tensor Precision and Memory

Memory usage is a function of the number of elements and the size of the data type.

FP32: 4 bytes per element. Standard but memory-intensive.
BF16: Developed to solve the instability (overflow/underflow) of FP16. It maintains the same dynamic range as FP32 but with lower precision, making it the "sweet spot" for deep learning.
Mixed Precision: A common practice where parameters, activations, and gradients are stored in BF16, while optimizer states are kept in FP32 for numerical stability.
FP4: A newer, highly compressed format (4 bits) that uses block-scaling to maintain dynamic range, used in models like NeMo-3 Super.

3. Computational Efficiency: Arithmetic Intensity

Memory Bound vs. Compute Bound:
- Memory Bound: Communication time > Computation time (e.g., ReLU, GELU, dot products, matrix-vector products).
- Compute Bound: Computation time > Communication time (e.g., large matrix-matrix multiplications).
Accelerator Intensity: For H100s, the intensity is ~300, meaning the hardware can perform 300 operations per byte transferred. If an algorithm's intensity is below this, it is memory-bound.
Strategy: To improve efficiency, one must increase arithmetic intensity, typically by using larger batch sizes or larger matrices to ensure the GPU is saturated with compute tasks.

4. Training Mechanics: FLOPs and Backprop

The 6x Rule: The total FLOPs for a training step is approximately $6 \times \text{Parameters} \times \text{Tokens}$.
- Forward Pass: $2 \times \text{Params} \times \text{Tokens}$.
- Backward Pass: $4 \times \text{Params} \times \text{Tokens}$ (calculating gradients for both parameters and inputs).
Einsum/Einops: The lecturer advocates for einops over standard PyTorch indexing to avoid confusion with transposes and dimension manipulation. It allows for named dimensions and cleaner, more modular code.

5. Memory Optimization Frameworks

Gradient Accumulation: Running multiple micro-batches and accumulating gradients before performing an optimizer step. This allows for larger effective batch sizes without exceeding memory limits.
Activation Checkpointing (Rematerialization): Trading compute for memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass.
- Trade-off: Storing checkpoints at $\sqrt{L}$ layers balances memory usage and recomputation overhead.

6. Notable Quotes

"The point is not to precisely calculate every single thing, but just get the rough shape of things."
"If you see that number [0.25 arithmetic intensity], you should say, 'Oh, this is really bad.'"
"Transformers are essentially big matrix multiplications with some things sprinkled in between."

Synthesis

The lecture establishes that training efficiency is governed by the interplay between memory bandwidth and compute throughput. By understanding the "arithmetic intensity" of operations, practitioners can diagnose bottlenecks. While matrix multiplications are naturally compute-bound and efficient, most other operations are memory-bound. Techniques like mixed precision, gradient accumulation, and activation checkpointing are essential tools to navigate these constraints, allowing for the training of massive models on finite hardware.