Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching

Key Concepts

Score Matching: A generative modeling paradigm that learns the gradient of the log-probability density function ($\nabla_x \log p(x)$) to guide samples toward high-density regions.
Langevin Dynamics: A stochastic sampling method that uses the score function plus noise to explore and sample from a distribution.
Denoising Score Matching (DSM): A technique to estimate the score by adding Gaussian noise to data and learning to predict the score of the noisy distribution.
Annealed Langevin Dynamics (ALD): A strategy of using multiple noise levels (from high to low) to guide sampling from a simple distribution to the complex data distribution.
Stochastic Differential Equations (SDEs): A continuous-time framework for modeling the forward (noising) and reverse (denoising) processes.
Probability Flow ODE (PF-ODE): A deterministic ordinary differential equation that shares the same marginal probability distributions as the SDE, allowing for faster sampling.
NFE (Number of Function Evaluations): A metric used to measure the computational cost of sampling; minimizing this is critical for efficiency.

1. The Score Matching Paradigm

The core objective is to generate samples from an unknown, complex data distribution $p_{data}(x)$.

The Score Function: Defined as $\nabla_x \log p(x)$. It points toward regions of higher probability density.
Why use the Score?
- Tractability: The normalizing constant $Z$ of a probability distribution is intractable, but $\nabla_x \log p(x) = \nabla_x \log \frac{p(x)}{Z} = \nabla_x \log p(x) - \nabla_x \log Z$. Since $\log Z$ is constant with respect to $x$, its gradient is zero, making the score tractable.
- Stability: The score is more numerically stable than the probability density itself, especially in low-density regions.

2. Denoising Score Matching (DSM)

Since the true score of $p_{data}$ is unknown, DSM approximates it by perturbing data with Gaussian noise.

Methodology: Given a data point $x$, we create a noisy version $\tilde{x} = x + \sigma\epsilon$. The score of the conditional distribution $q_\sigma(\tilde{x}|x)$ is analytically known as $-\frac{\tilde{x}-x}{\sigma^2}$.
Objective: Train a neural network $s_\theta(\tilde{x})$ to minimize the L2 distance between the predicted score and the true conditional score.
Trade-off: Small $\sigma$ makes the noisy distribution close to $p_{data}$ but leads to poor score estimation in low-density regions. Large $\sigma$ provides better estimates but deviates significantly from $p_{data}$.

3. Annealed Langevin Dynamics (ALD)

To solve the trade-off, ALD uses a sequence of noise levels ${\sigma_1, \sigma_2, \dots, \sigma_L}$.

Process: Start with a sample from a high-noise distribution (where the score provides a rough "compass" direction) and iteratively refine the sample using lower noise levels. This allows the model to explore the space globally before focusing on local details.

4. Continuous Formulation: SDEs and ODEs

The lecture unifies DDPM (Diffusion) and NCSN (Score Networks) using SDEs.

Forward SDE: $dx = f(x, t)dt + g(t)dw$, where $f$ is the drift and $g$ is the diffusion coefficient.
Reverse SDE: The process can be reversed to generate data: $dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}$.
Probability Flow ODE: By removing the stochastic term ($dw$), one can derive an ODE that preserves the marginal distributions of the SDE. This allows for deterministic sampling and the use of advanced numerical solvers (e.g., Runge-Kutta).

5. Advanced Sampling: DPM-Solver

The Problem: Standard ODE solvers (like Euler) require many steps (high NFE) to maintain accuracy.
The Solution: Since the drift term $f(x, t)$ is often linear in $x$, DPM-Solver uses the "variation of constants" formula to solve the linear part exactly and discretize only the non-linear part (the neural network score approximation).
Result: This significantly reduces the required NFE to achieve high-quality image generation without needing to retrain the model.

Synthesis/Conclusion

The transition from discrete diffusion (DDPM) to continuous SDE/ODE frameworks represents a major evolution in generative modeling. By viewing the generation process as a flow toward high-density regions guided by the score function, researchers have developed more efficient, deterministic, and high-quality sampling methods. The unification of these paradigms allows for the application of sophisticated mathematical tools from differential equations to improve the speed and fidelity of modern generative AI.