Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 9 - Recap & Current Trends

Key Concepts

Evolution of Language Models: From tokenization and word embeddings (word2vec) to RNNs, self-attention, and the Transformer architecture.
Transformer Architecture: Encoder-decoder structure, positional encoding (RoPE), and variations (BERT, GPT, T5).
LLM Training: Pre-training, Supervised Fine-Tuning (SFT), and Preference Tuning (RL with PPO/GRPO).
RAG & Tool Calling: Enhancing LLMs with external knowledge and API interaction.
Diffusion Models: Adapting noise-based image generation to text via masked diffusion (MDM, DLLM).
LLM Evaluation: Challenges of traditional metrics and the rise of LLM-as-a-judge.
Emerging Trends: Vision Transformers, diffusion models, and ongoing research in data curation, continuous learning, and hardware optimization.

Historical Progression of Language Models

The lecture began by tracing the development of language models, starting with the initial challenge of processing text which necessitated tokenization – breaking down input into manageable units. Subword level tokenizers proved effective, allowing for reuse of word roots. Early approaches utilized word embeddings like word2vec, which learned representations based on context, but lacked context-awareness as a single word received the same representation regardless of its surrounding text. This limitation led to the adoption of Recurrent Neural Networks (RNNs), designed to maintain an internal state representing the sequence. However, RNNs struggled with the long-range dependency problem, hindering their ability to retain information over extended sequences. The core solution presented throughout the course is self-attention, enabling direct connections between all tokens, regardless of their position. Self-attention operates using query, key, and value vectors, calculating similarity (dot product, scaled and softmaxed) to weight values, expressed mathematically as softmax(q k^T / sqrt(dk)) v.

The Transformer Architecture & its Refinements

The Transformer architecture, with its encoder and decoder, emerged as the foundation for modern LLMs, initially applied to machine translation in 2017. Early Transformers used absolute positional encoding, but the lecture emphasized the benefits of relative position, leading to the adoption of Rotary Position Embeddings (RoPE), which rotate query and key vectors to encode distance. Further improvements included Grouped Query Attention (GQA), reducing the number of projection matrices, and adjustments to normalization layers (post-norm vs. pre-norm). Transformer-derived models are categorized as encoder-only (BERT) for embedding and classification, decoder-only (GPT) for auto-regressive text generation, and encoder-decoder (T5) for text-to-text tasks. Large Language Models (LLMs) are specifically defined as transformer-based, decoder-only models.

Scaling LLMs & Training Techniques

Scaling LLMs presented computational challenges, leading to the exploration of Mixture of Experts (MoE), where only a subset of experts are activated for each input, reducing the computational load. MoE implementations route tokens to different experts for parallel processing on GPUs. LLMs are trained in three stages: pre-training (on massive datasets – trillions of tokens, requiring at least 20x more tokens than parameters), Supervised Fine-Tuning (SFT) (adapting to specific tasks), and Preference Tuning (aligning with human preferences using pairwise comparison data). Preference tuning utilizes Reinforcement Learning (RL), treating the LLM as a policy, the input as a state, and token prediction as an action. A reward model, trained using the Bradley-Terry formulation, predicts preference between outputs. Proximal Policy Optimization (PPO) was initially used, but Generalized PPO (GRPO) is gaining traction due to its reduced computational cost (eliminating the value function) and effectiveness for tasks with verifiable rewards.

Enhancing LLMs: RAG, Tool Calling & Evaluation

Retrieval Augmented Generation (RAG) enhances LLMs by retrieving relevant documents from a knowledge base, utilizing candidate retrieval (semantic search with bi-encoders) and re-ranking (cross-encoders). Tool calling enables LLMs to interact with external APIs, identifying the appropriate API and arguments, executing the call, and incorporating the results. Modern agentic workflows combine RAG and tool calling. Evaluating LLMs is complex; traditional metrics (BLEU, ROUGE, METEOR) are limited. LLM-as-a-judge leverages LLMs for evaluation, providing rationale and scores, but requires addressing biases (position, verbosity, self-enhancement). Benchmarks are used to assess performance across knowledge, reasoning, coding, and safety.

Diffusion Models: A New Paradigm for LLMs

The lecture transitioned to diffusion models, initially developed for image generation, and their adaptation to text. Image diffusion starts with noise and transforms it into an image, preferred over autoregressive methods due to the high dimensionality of images. The process is likened to sculpting – removing noise to reveal the desired form. The two-step process involves adding noise to clean images and then learning to reverse this process. Applying diffusion to text presents a challenge due to the discrete nature of tokens. The solution involves using a mask token as the equivalent of noise, masking tokens in the forward process and learning to “unmask” them in the reverse process. This led to the development of Masked Diffusion Models (MDM) and Diffusion-based LLMs (DLLM). Diffusion in text is analogous to drafting and refining a speech. Diffusion models offer faster decoding than autoregressive models because the number of forward passes is determined by the number of diffusion steps, rather than the sequence length.

Current Research & Future Directions

While initially lagging, masked diffusion models are approaching the performance of state-of-the-art LLMs. Ongoing research focuses on data curation to prevent “model collapse,” continuous learning, addressing hallucinations (inherent to next-token prediction), hardware optimization, democratization of agents, improving agent reliability, and developing AI-powered customer service. Key technical terms discussed included Gaussian Distribution, Autoregressive Models, RMSNorm, FlashAttention, and Muon. Resources for further learning include the LADA Paper, Archive, NewRips, Hugging Face Trending Papers, Twitter/X, and the YouTube channels of Yanik Kilcher & Andriy Karpathy.

Conclusion

The lecture provided a comprehensive overview of the evolution of language models, from foundational concepts like tokenization and word embeddings to cutting-edge techniques like diffusion models. The shift from RNNs to self-attention and the Transformer architecture has been pivotal, enabling the development of powerful LLMs. Current research is focused on scaling these models, improving their training, enhancing their capabilities with RAG and tool calling, and exploring alternative paradigms like diffusion models to overcome the limitations of autoregressive approaches. The field is rapidly evolving, demanding continuous learning and adaptation to new advancements.