Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

Key Concepts

Language Modeling: The task of predicting the next token/word in a sequence, effectively modeling the probability distribution of language.
Auto-regression: A process where the model predicts one token at a time, using the output of the previous step as input for the next.
Transformer Architecture: A neural network design utilizing attention mechanisms that allows for scalable, parallelizable, and context-aware sequence processing.
Pre-training vs. Post-training: The two-stage process of first learning general patterns from massive datasets (pre-training) and then refining the model for specific instruction-following or safety behaviors (post-training).
Scaling Laws: The empirical observation that increasing compute, data, and parameter counts leads to a consistent, predictable decrease in test loss.
RLHF (Reinforcement Learning from Human Feedback): A technique used in post-training to align model outputs with human preferences using a reward model.
Tokenization: The process of breaking text into subword units (e.g., Byte Pair Encoding) to handle vocabulary efficiently.

1. The Scale and Industrialization of Language Models

Modern Large Language Models (LLMs) have reached an industrial scale. For example, models like Qwen 3 are trained on 36 trillion tokens (approx. 27 trillion words), requiring massive data centers and thousands of H100 GPUs.

Data Scale: 144 terabytes of raw text is equivalent to 90 billion sheets of paper stacked 9,000 km high.
Compute Cost: Training a state-of-the-art model can cost upwards of $42 million in compute time alone, with training durations spanning hundreds of years on a single laptop.
Infrastructure: The industry is moving toward space-based infrastructure and massive data centers to sustain the growth of these models.

2. Fundamental Mechanics of Language Models

Probabilistic View: A language model is a distribution over sequences. Using the Chain Rule of Probability, the joint probability of a sequence is decomposed into the product of conditional probabilities: $P(w_1, w_2, ..., w_n) = \prod P(w_i | w_1, ..., w_{i-1})$.
Tensor Representation: Words are mapped to continuous vectors (embeddings). The model performs a multi-class classification task over a vocabulary ($V$) to predict the next token.
Batching: To optimize training, sequences are processed in batches ($B \times T \times D$), where $T$ is sequence length and $D$ is embedding dimension.

3. Why Model Language?

Universal Task Solver: Many real-world tasks (coding, email writing, logical reasoning) are essentially sequence completion problems.
Multitask Learning: By training on diverse data (Wikipedia, code, math), the model learns to perform multiple tasks simultaneously without needing task-specific labels.
Scaling Efficiency: As demonstrated by the GPT-3 paper, simply increasing model size and data leads to "few-shot" learning capabilities, where the model performs tasks based on in-context examples without weight updates.

4. Architectural Evolution: From MLP to Transformers

MLP Limitations: Multi-Layer Perceptrons fail to scale because they lack dynamic weights, have high parameter dependency on sequence length, and do not allow for efficient computation reuse.
Transformer Advantages: The Transformer architecture (introduced in "Attention Is All You Need") uses attention mechanisms to dynamically weigh the importance of different input positions, allowing for better scalability and computation reuse.

5. The Training Pipeline

Pre-training: Uses massive, curated internet datasets (e.g., Common Crawl) to instill general knowledge via next-token prediction.
Post-training (Instruction Tuning): Transforms the model from an "autocomplete engine" into a helpful assistant.
- Supervised Fine-Tuning (SFT): Training on high-quality question-answer pairs.
- Reward Modeling & RL: Using human feedback to train a reward model, then using reinforcement learning (policy gradients) to optimize the model to align with human preferences.
Safety Tuning: A critical post-training step to prevent harmful outputs (e.g., bomb-making instructions). This is an ongoing "cat-and-mouse" game, as users find "jailbreaks" (e.g., role-playing as a grandmother) to bypass safety filters.

6. Systems and Efficiency

Quantization: Reducing the precision of weights (e.g., 4-bit or 2-bit) to fit massive models into limited GPU memory.
Parallelism: Techniques like model sharding (splitting layers or matrices across GPUs) and data parallelism are essential for training models with trillions of parameters.
Kernel Fusion: Optimizing hardware usage by combining small operations (e.g., Flash Attention) to reduce memory bandwidth bottlenecks.

Synthesis and Conclusion

Language models have evolved from simple statistical counting tables to massive, unified neural systems that define the current AI landscape. The "recipe" for success—scaling compute, data, and parameters—has proven remarkably effective. While "frontier" models (closed-source) currently lead in performance, the rise of "open-weight" models is rapidly closing the gap, democratizing access to powerful AI. The field is now shifting toward complex challenges like AI safety, multimodality (integrating vision/audio), and the societal implications of these powerful, industrialized artifacts.