Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 10: Inference
By Stanford Online
Key Concepts
- Inference: The process of using a trained model to generate responses from prompts.
- KV Cache: A memory buffer storing Key and Value tensors for previous tokens to avoid redundant re-computation during auto-regressive generation.
- Arithmetic Intensity: The ratio of floating-point operations (FLOPs) to memory bytes transferred; high intensity indicates compute-bound tasks, while low intensity indicates memory-bound tasks.
- TTFT (Time to First Token): The latency before the first token is generated, critical for interactive user experiences.
- Throughput: The number of tokens generated per second across multiple concurrent requests.
- Continuous Batching: A technique to dynamically update batches by adding new requests as others finish, maximizing hardware utilization.
- PagedAttention: A memory management technique (similar to OS virtual memory) that stores KV caches in non-contiguous blocks to eliminate fragmentation.
- Speculative Decoding: Using a small, fast "draft" model to generate tokens, which are then verified in parallel by a larger "target" model.
1. The Importance and Challenges of Inference
Inference is a repeated, daily cost, unlike the one-time cost of training. As AI moves toward "agentic" workflows—where models reason, call tools, and introspect—the number of tokens generated per query increases significantly, making efficiency paramount.
- The Fundamental Bottleneck: Unlike training, where all tokens are processed in parallel, inference is auto-regressive (sequential). This prevents full parallelization across the sequence dimension, leading to lower arithmetic intensity.
- Memory vs. Compute: Inference is primarily memory-bound. Because the model must load parameters and the KV cache from High Bandwidth Memory (HBM) for every token generated, the speed is limited by memory bandwidth rather than raw compute power.
2. Metrics and Trade-offs
- Latency vs. Throughput: Increasing batch size ($B$) improves throughput by amortizing the cost of loading model parameters, but it worsens individual query latency due to the increased size of the KV cache and the need to wait for other requests in the batch.
- TTFT: Primarily determined by the "prefill" stage, where the prompt is processed. This stage is compute-bound and parallelizable.
3. Techniques to Improve Inference Efficiency
Architectural Optimizations
- Grouped Query Attention (GQA): Reduces the size of the KV cache by sharing keys and values across multiple query heads, significantly improving speed with minimal accuracy loss.
- Multi-Latent Attention (MLA): Used in DeepSeek-V2, this compresses KV pairs into a lower-dimensional latent space, drastically reducing memory footprint.
- Sliding Window & Linear Attention: Limits the context window or uses compressed representations (e.g., Mamba, DeltaNet) to keep memory usage independent of sequence length.
Systems and Algorithmic Optimizations
- Quantization: Reducing precision (e.g., FP16 to INT4) lowers memory usage. Techniques like GPTQ and Activation-Aware Quantization help maintain accuracy by preserving precision for important channels.
- Model Pruning: Removing redundant hidden units or layers and "healing" the model through post-training distillation.
- Speculative Decoding: A "draft" model generates a sequence of tokens, and the "target" model verifies them in a single parallel pass. This provides a speedup while maintaining the exact output distribution of the target model.
4. Dynamic Workload Management
- Continuous Batching: Prevents idle hardware by inserting new requests into the batch as soon as others complete, rather than waiting for the entire batch to finish.
- PagedAttention: Solves memory fragmentation by storing KV caches in non-contiguous blocks. This allows for efficient memory sharing (e.g., sharing the KV cache of a common system prompt across multiple users).
5. Synthesis and Conclusion
Inference efficiency is a multi-layered challenge requiring a combination of architectural innovation and systems engineering. The primary goal is to reduce the KV cache size and increase arithmetic intensity without sacrificing model accuracy. While Transformers are currently the standard, their auto-regressive nature makes them inherently "inference-unfriendly." Future breakthroughs may lie in new architectures (like State Space Models) designed specifically for efficient inference, alongside continued refinement of systems-level techniques like PagedAttention and speculative execution.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.