Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

By Unknown Author

Share:

Key Concepts

  • Multimodal Guided Generation: The process of generating images conditioned on external inputs like text prompts or reference images.
  • Latent Space: A lower-dimensional, compressed representation of data that is more computationally tractable than raw pixel space.
  • Variational Autoencoder (VAE): An architecture consisting of an encoder (compressing input to a latent distribution) and a decoder (reconstructing the input).
  • Semantic vs. Perceptual Similarity: Semantic similarity refers to global structure/meaning; perceptual similarity refers to local textures and human-perceived quality.
  • Classifier-Free Guidance (CFG): A technique to guide generation by interpolating between unconditional and conditional noise predictions, eliminating the need for a separate classifier.
  • Contrastive Learning (CLIP): A method to align text and image embeddings in a shared space to enable cross-modal conditioning.

1. Representing Noisy Images: The Latent Space

The lecture addresses the inefficiency of working in pixel space (e.g., $1024 \times 1024$ images result in $\sim 10^6$ dimensions). Pixel space is redundant, high-dimensional, and lacks meaningful structure for diffusion models.

  • The Wish List: A space that is tractable, compact, and meaningful (where valid images form clusters).
  • Autoencoder Framework: Uses an encoder ($E_\phi$) to compress images into a latent representation ($z$) and a decoder ($D_\theta$) to reconstruct them.
  • Variational Autoencoder (VAE): To structure the latent space, the encoder outputs a mean ($\mu$) and variance ($\sigma^2$) of a distribution. The latent $z$ is sampled from this distribution, which is regularized to approximate a standard normal distribution (the prior).
  • Loss Function: Composed of a reconstruction loss (pixel-wise L2) and a regularization loss (KL divergence between the latent distribution and the prior).

2. Combating Blur and Improving Quality

Standard VAEs often produce blurry images due to the pixel-wise L2 loss. Two strategies are introduced:

  • Perceptual Loss (LPIPS): Compares feature maps from pre-trained convolutional networks rather than raw pixels, focusing on structural similarity rather than exact pixel alignment.
  • Adversarial Loss (GANs): A discriminator network is trained to distinguish between real images and decoder outputs, forcing the decoder to produce sharper, more realistic details.

3. Representing Conditions (Text and Images)

  • Text Embeddings: Uses Transformers and tokenization (subword level). The attention mechanism allows tokens to be represented as a function of others. The final embedding of the last token in a decoder-only transformer often captures the global semantic meaning of the sentence.
  • Image Embeddings: Uses Vision Transformers (ViT), which treat image patches as tokens.
  • CLIP (Contrastive Language-Image Pre-training): Projects text and image embeddings into a shared space. It uses a symmetric cross-entropy loss to maximize the similarity of true image-caption pairs while pushing unrelated pairs apart.

4. Guided Generation Frameworks

The goal is to steer the diffusion/flow-matching process toward a condition $Y$.

  • Classifier Guidance: Uses a separate classifier to provide gradients that shift the mean of the diffusion process.
    • Drawback: Requires training a classifier on noisy images and performing expensive backward passes during inference.
  • Classifier-Free Guidance (CFG):
    • Methodology: During training, the model is trained on both unconditional (no prompt) and conditional (with prompt) inputs.
    • Inference: The noise prediction is calculated as: $\epsilon_{guided} = \epsilon_{uncond} + w \cdot (\epsilon_{cond} - \epsilon_{uncond})$.
    • Guidance Scale ($w$): A hyperparameter that controls how strongly the model adheres to the condition. Higher values increase adherence but may introduce artifacts.

5. Synthesis and Conclusion

The modern generation paradigm involves:

  1. Training a VAE to map images to a smooth, semantically meaningful latent space.
  2. Training a diffusion/flow-matching model in this latent space to handle the generation process efficiently.
  3. Using CFG to incorporate conditions (text/images) without the overhead of external classifiers.

The encoder acts as a low-pass filter focusing on semantic meaning, while the decoder is tasked with reconstructing high-frequency perceptual details. This separation of concerns allows for scalable, high-quality image generation.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video