Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization

Key Concepts

From-Scratch Philosophy: Building language models (LMs) from the ground up to understand fundamental mechanics, rather than relying on high-level abstractions.
Scaling Laws: Empirical relationships between compute budget, model size, and data size that allow for predicting performance at larger scales.
Efficiency: The core metric of the course; defined as the ratio of output to input, critical for large-scale training where compute costs are immense.
Byte Pair Encoding (BPE): A data-driven tokenization algorithm that merges frequent byte pairs to create a vocabulary, balancing sequence length and vocabulary size.
Resource Accounting: Tracking FLOPs (Floating Point Operations) and memory usage to identify bottlenecks.
Roofline Analysis: A methodology to determine if a computation is bottlenecked by memory bandwidth or compute throughput.
Alignment: Using weak supervision (e.g., RLHF, DPO) to refine models after pre-training.
Emergence: The phenomenon where models exhibit new capabilities (e.g., in-context learning) only after reaching a critical scale.

1. Course Philosophy and Objectives

The course, CS 336: Language Models from Scratch, emphasizes that while modern AI allows for "zero-shot" prompting, fundamental research requires a deep understanding of the entire stack. The instructors argue that abstractions are "leaky," and to push the boundaries of AI, one must understand the underlying systems, architectures, and training dynamics.

The "Bitter Lesson": The instructors clarify that this does not mean algorithms don't matter; rather, it means algorithms that scale are what matter.
Three Pillars of Knowledge:
1. Mechanics: How transformers, parallelism, and kernels work.
2. Mindset: Profiling, benchmarking, and optimizing for efficiency.
3. Intuitions: Data and modeling decisions (often gained through experimentation).

2. Course Structure and Logistics

The course is divided into five parts, corresponding to five intensive assignments:

Basics: Tokenization, architecture, and training loops.
Systems: Kernels, parallelization, and inference optimization.
Scaling Laws: Predicting performance and hyperparameter transfer.
Data: Curation, filtering, deduplication, and synthetic data.
Alignment: RLHF, DPO, and GRPO.

AI Policy: Students are encouraged to use AI agents for tutoring and debugging, provided they use a specific "pedagogically minded" prompt to avoid bypassing the learning objectives.
Compute: The course utilizes Modal for cloud compute, allowing students to perform actual training runs and benchmarking.

3. Technical Deep Dive: Tokenization

Tokenization is the process of converting raw bytes into integer sequences.

The Problem: Character-level tokenization is inefficient (too many tokens), while word-level tokenization suffers from an unbounded vocabulary and "unknown" (UNK) token issues.
BPE Methodology:
1. Start with a byte-level representation.
2. Iteratively count the frequency of adjacent token pairs.
3. Merge the most frequent pair into a new token.
4. Repeat until the desired vocabulary size is reached.
Goal: Achieve a high compression ratio (bytes per token) to minimize the sequence length, as transformer attention computation is quadratic ($O(n^2)$).

4. Systems and Hardware

The course emphasizes that memory movement is the primary bottleneck in modern GPU computing.

Operator Fusion: Combining multiple operations into a single kernel to read data from High Bandwidth Memory (HBM) once, perform multiple computations, and write back once.
Distributed Training: When scaling to thousands of GPUs, the challenge shifts to orchestrating data movement via collective operations like all-reduce and gather.
Inference: Divided into prefill (processing the prompt) and decode (generating tokens one by one). The latter is memory-bound, necessitating techniques like speculative decoding and quantization.

5. Scaling Laws and Research

The Chinchilla Rule of Thumb: For compute-optimal training, one should train on roughly 20 tokens per parameter.
Predictability: The goal of scaling laws is not just optimality, but predictability. By running small-scale experiments, researchers can fit curves to project the loss at a target scale (e.g., $10^{25}$ FLOPs), which is essential for justifying massive compute investments.
Hyperparameter Transfer: Models must be parameterized such that optimal hyperparameters at small scales remain effective or predictable at larger scales.

6. Synthesis and Conclusion

The instructors conclude that while the "ChatGPT era" has shifted the focus toward agents and multi-modality, the fundamentals—transformers, gradient-based optimization, and GPU kernels—remain the bedrock of the field. The course aims to equip students with the "engineering muscles" to navigate the trade-offs between model expressivity, training stability, and hardware efficiency. The ultimate takeaway is that building a successful language model is a balancing act: maximizing efficiency within a fixed resource budget while ensuring the model is stable enough to reach convergence.