Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching

By Stanford Online

Share:

Key Concepts

  • Score Matching: A generative modeling paradigm that learns the gradient of the log-probability density function ($\nabla_x \log p(x)$) to guide samples toward high-density regions.
  • Langevin Dynamics: A stochastic sampling method that uses the score function plus noise to explore and sample from a distribution.
  • Denoising Score Matching (DSM): A technique to estimate the score by adding Gaussian noise to data and learning to predict the score of the noisy distribution.
  • Annealed Langevin Dynamics (ALD): A strategy of using multiple noise levels (from high to low) to guide sampling from a simple distribution to the complex data distribution.
  • Stochastic Differential Equations (SDEs): A continuous-time framework for modeling the forward (noising) and reverse (denoising) processes.
  • Probability Flow ODE (PF-ODE): A deterministic ordinary differential equation that shares the same marginal probability distributions as the SDE, allowing for faster sampling.
  • NFE (Number of Function Evaluations): A metric used to measure the computational cost of sampling; minimizing this is critical for efficiency.

1. The Score Matching Paradigm

The core objective is to generate samples from an unknown, complex data distribution $p_{data}(x)$.

  • The Score Function: Defined as $\nabla_x \log p(x)$. It points toward regions of higher probability density.
  • Why use the Score?
    • Tractability: The normalizing constant $Z$ of a probability distribution is intractable, but $\nabla_x \log p(x) = \nabla_x \log \frac{p(x)}{Z} = \nabla_x \log p(x) - \nabla_x \log Z$. Since $\log Z$ is constant with respect to $x$, its gradient is zero, making the score tractable.
    • Stability: The score is more numerically stable than the probability density itself, especially in low-density regions.

2. Denoising Score Matching (DSM)

Since the true score of $p_{data}$ is unknown, DSM approximates it by perturbing data with Gaussian noise.

  • Methodology: Given a data point $x$, we create a noisy version $\tilde{x} = x + \sigma\epsilon$. The score of the conditional distribution $q_\sigma(\tilde{x}|x)$ is analytically known as $-\frac{\tilde{x}-x}{\sigma^2}$.
  • Objective: Train a neural network $s_\theta(\tilde{x})$ to minimize the L2 distance between the predicted score and the true conditional score.
  • Trade-off: Small $\sigma$ makes the noisy distribution close to $p_{data}$ but leads to poor score estimation in low-density regions. Large $\sigma$ provides better estimates but deviates significantly from $p_{data}$.

3. Annealed Langevin Dynamics (ALD)

To solve the trade-off, ALD uses a sequence of noise levels ${\sigma_1, \sigma_2, \dots, \sigma_L}$.

  • Process: Start with a sample from a high-noise distribution (where the score provides a rough "compass" direction) and iteratively refine the sample using lower noise levels. This allows the model to explore the space globally before focusing on local details.

4. Continuous Formulation: SDEs and ODEs

The lecture unifies DDPM (Diffusion) and NCSN (Score Networks) using SDEs.

  • Forward SDE: $dx = f(x, t)dt + g(t)dw$, where $f$ is the drift and $g$ is the diffusion coefficient.
  • Reverse SDE: The process can be reversed to generate data: $dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}$.
  • Probability Flow ODE: By removing the stochastic term ($dw$), one can derive an ODE that preserves the marginal distributions of the SDE. This allows for deterministic sampling and the use of advanced numerical solvers (e.g., Runge-Kutta).

5. Advanced Sampling: DPM-Solver

  • The Problem: Standard ODE solvers (like Euler) require many steps (high NFE) to maintain accuracy.
  • The Solution: Since the drift term $f(x, t)$ is often linear in $x$, DPM-Solver uses the "variation of constants" formula to solve the linear part exactly and discretize only the non-linear part (the neural network score approximation).
  • Result: This significantly reduces the required NFE to achieve high-quality image generation without needing to retrain the model.

Synthesis/Conclusion

The transition from discrete diffusion (DDPM) to continuous SDE/ODE frameworks represents a major evolution in generative modeling. By viewing the generation process as a flow toward high-density regions guided by the score function, researchers have developed more efficient, deterministic, and high-quality sampling methods. The unification of these paradigms allows for the application of sophisticated mathematical tools from differential equations to improve the speed and fidelity of modern generative AI.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 2 - Score matching". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video