Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 5: GPUs, TPUs

Key Concepts

GPU Architecture: Streaming Multiprocessors (SM), Warps, Blocks, and Threads.
Memory Hierarchy: Global Memory (HBM), L2 Cache, L1/Shared Memory, and Registers.
Roofline Model: A framework to visualize the limit between memory-bound and compute-bound performance.
SIMT (Single Instruction, Multiple Threads): The execution model where threads execute the same instruction on different data.
Tiling: A technique to partition data into smaller chunks that fit into fast shared memory to minimize global memory access.
Operator Fusion: Combining multiple operations into a single kernel to reduce memory round-trips.
Flash Attention: An I/O-aware exact attention algorithm that uses tiling and recomputation to achieve sub-quadratic memory access.

1. GPU Hardware Model vs. CPU

Philosophy: CPUs are designed for low-latency serial execution with complex branching. GPUs are designed for high-throughput parallel execution using thousands of lightweight cores.
Streaming Multiprocessor (SM): The fundamental compute unit of a GPU. An A100 GPU contains 128 SMs.
Memory Latency: Accessing Global Memory (HBM) is significantly slower (10x+) than accessing L1/Shared memory. Efficient programming requires respecting this hierarchy.
TPUs vs. GPUs: TPUs are specialized for machine learning with fewer, larger matrix multiply units (MXUs) compared to the numerous smaller tensor cores in GPUs. Both rely on similar memory hierarchies and systolic array architectures.

2. Six Tricks for GPU Optimization

Avoid Control Divergence: Because GPUs use the SIMT model, if-else statements cause threads to mask out, leading to idle compute units. Use mathematical masks (e.g., multiplying by 0) instead.
Low Precision Computation: Moving from FP32 to BF16, FP8, or even FP4 reduces memory bandwidth requirements. Modern techniques like MXFP8 use multiple scaling factors per sub-matrix to maintain precision, though this requires complex handling (e.g., storing transposed copies).
Operator Fusion: Combining multiple operations (e.g., sin^2 + cos^2) into one kernel prevents redundant reads/writes to global memory.
Recomputation: In backpropagation, instead of storing massive intermediate activations, recompute them on-the-fly during the backward pass. This trades extra compute for significant memory savings.
Memory Coalescing: DRAM reads occur in "bursts." If threads in a warp access contiguous memory addresses, they can be serviced in a single burst, maximizing bandwidth.
Tiling: Breaking large matrices into smaller tiles that fit into shared memory. This allows the GPU to perform multiple operations on the same data without returning to slow global memory.

3. The "Mystery" of Throughput Plots

Wave Quantization: Performance drops occur when the number of tiles does not perfectly divide into the number of available SMs, leaving hardware idle.
Alignment/Padding: Matrix dimensions that are not multiples of 16 or 32 can cause misaligned memory reads, breaking coalescing and significantly reducing throughput.
Roofline Model: The goal is to reach the "flat" part of the curve (compute-bound) by increasing arithmetic intensity, ensuring the compute units are fully saturated.

4. Case Study: Flash Attention

The Problem: Standard attention is memory-bound due to the $O(N^2)$ memory access required for the attention matrix.
The Solution:
- Tiling: Breaks the $Q, K, V$ matrices into blocks that fit in SRAM.
- Online Softmax: Computes the softmax in a single pass by tracking running statistics (max and sum of exponentials), allowing the algorithm to process tiles without needing the full matrix in memory.
- Recomputation: Avoids storing the $N \times N$ attention matrix during the forward pass, recomputing it during the backward pass to save memory.

5. Notable Quotes

"Modern sort of hardware and LLM optimization is really defined by the memory."
"The game for the last few years has been how can we bring more compute to bear... and systems is part of that."
"If you're not a systems person... you kind of need to understand the hardware model of how your model is going to execute in order to make an effective model."

Synthesis

The transition from serial CPU scaling (Dennard scaling) to parallel GPU scaling has shifted the bottleneck from clock speed to memory bandwidth. To achieve high performance, developers must move beyond high-level abstractions and adopt a "systems mindset." This involves minimizing global memory traffic through tiling, fusion, and recomputation, while carefully aligning data structures to hardware-specific constraints like burst sizes and SM counts. Flash Attention serves as the definitive example of how these low-level optimizations can fundamentally improve the scalability of modern AI models.