Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 1 - Diffusion
By Unknown Author
Share:
Key Concepts
- Diffusion Models: Generative models that learn to reverse a noise-adding process to generate data from Gaussian noise.
- Forward Process ($Q$): A Markovian process that gradually adds Gaussian noise to clean data ($x_0$) until it becomes pure noise ($x_T$).
- Reverse Process ($P_\theta$): The learned model that denoises the data step-by-step to recover the original distribution.
- Noise Schedule ($\beta_t$): The variance schedule controlling how much noise is added at each step $t$.
- ELBO (Evidence Lower Bound): A mathematical objective used to optimize the model by maximizing a lower bound of the log-likelihood of the data.
- DDPM (Denoising Diffusion Probabilistic Models): The foundational framework for modern diffusion, utilizing a stochastic reverse process.
- DDIM (Denoising Diffusion Implicit Models): A variant that enables faster sampling by making the reverse process deterministic, allowing for "step-skipping."
- FID (Fréchet Inception Distance): A metric used to evaluate the quality and diversity of generated images.
1. Introduction and Course Objectives
The course, CME 296, focuses on the paradigms, training, and evaluation of image generation models. The instructors emphasize that the field has evolved from generating low-resolution, black-and-white images (2014) to high-resolution, photorealistic color images.
- Goals: Understand generation paradigms (Diffusion, Score/Flow matching), model architectures (Transformers/UNets), and evaluation metrics.
- Prerequisites: Linear algebra (vectors, matrices, gradients), probability theory (Bayes' rule, Gaussian distributions, covariance), differential equations (ODEs/SDEs), and basic machine learning (training/inference).
2. The Diffusion Paradigm
The core intuition is to start from a simple distribution (Gaussian noise) and iteratively refine it into a clean image.
- Why Noise? It is easy to sample, provides randomness for diversity, and possesses mathematical properties (Gaussian) that simplify the training objective.
- Representation: Images are treated as vectors of pixel values (RGB). While discrete (0–255), they are treated as continuous floats during training.
- Forward Process ($Q$): Defined as $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$.
- Closed-form Sampling: Using the property that the sum of independent Gaussians is Gaussian, one can sample $x_t$ directly from $x_0$ using $\bar{\alpha}_t$ (the product of $1-\beta_i$): $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$.
3. Training and the ELBO
The goal is to maximize the log-likelihood of the training data, which is intractable.
- The ELBO Trick: By using Jensen’s Inequality, the instructors derive a lower bound that involves the KL divergence between the true posterior $q(x_{t-1} | x_t, x_0)$ and the model's prediction $p_\theta(x_{t-1} | x_t)$.
- Tractability: Because both the forward process and the model are assumed to be Gaussian, the KL divergence simplifies to an L2 regression loss on the noise: $L_{DDPM} = \mathbb{E}{t, x_0, \epsilon} [||\epsilon - \epsilon\theta(x_t, t)||^2]$.
- Interpretation: The model $\epsilon_\theta$ is trained to predict the noise $\epsilon$ that was added to the image $x_0$ at time $t$.
4. Inference and Acceleration (DDIM)
Standard DDPM requires $T$ steps (often 1,000), making inference slow.
- The Problem: DDPM is stochastic at every step, which prevents skipping steps without significant quality degradation.
- The DDIM Solution: By defining a non-Markovian forward process that yields the same marginals as DDPM but allows for a deterministic reverse process ($\sigma=0$), the model can skip steps.
- Efficiency: By choosing a sequence of $S$ steps (where $S \ll T$), one can achieve significant speedups (10x–100x) with minimal loss in image quality (measured by FID).
5. Synthesis and Takeaways
- Mathematical Rigor: The field relies heavily on probability theory and calculus; however, the final training objective is a straightforward L2 regression.
- Intuition: The "sculpture" analogy—starting from a block of noise and carving away the unnecessary parts—is the fundamental mechanism of diffusion.
- Actionable Insight: To optimize for speed in production, move from stochastic models (DDPM) to implicit, deterministic models (DDIM) to reduce the number of required model evaluations.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 1 - Diffusion". What would you like to know?
Chat is based on the transcript of this video and may not be 100% accurate.