Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance
By Unknown Author
Key Concepts
- Multimodal Guided Generation: The process of generating images conditioned on external inputs like text prompts or reference images.
- Latent Space: A lower-dimensional, compressed representation of data that is more computationally tractable than raw pixel space.
- Variational Autoencoder (VAE): An architecture consisting of an encoder (compressing input to a latent distribution) and a decoder (reconstructing the input).
- Semantic vs. Perceptual Similarity: Semantic similarity refers to global structure/meaning; perceptual similarity refers to local textures and human-perceived quality.
- Classifier-Free Guidance (CFG): A technique to guide generation by interpolating between unconditional and conditional noise predictions, eliminating the need for a separate classifier.
- Contrastive Learning (CLIP): A method to align text and image embeddings in a shared space to enable cross-modal conditioning.
1. Representing Noisy Images: The Latent Space
The lecture addresses the inefficiency of working in pixel space (e.g., $1024 \times 1024$ images result in $\sim 10^6$ dimensions). Pixel space is redundant, high-dimensional, and lacks meaningful structure for diffusion models.
- The Wish List: A space that is tractable, compact, and meaningful (where valid images form clusters).
- Autoencoder Framework: Uses an encoder ($E_\phi$) to compress images into a latent representation ($z$) and a decoder ($D_\theta$) to reconstruct them.
- Variational Autoencoder (VAE): To structure the latent space, the encoder outputs a mean ($\mu$) and variance ($\sigma^2$) of a distribution. The latent $z$ is sampled from this distribution, which is regularized to approximate a standard normal distribution (the prior).
- Loss Function: Composed of a reconstruction loss (pixel-wise L2) and a regularization loss (KL divergence between the latent distribution and the prior).
2. Combating Blur and Improving Quality
Standard VAEs often produce blurry images due to the pixel-wise L2 loss. Two strategies are introduced:
- Perceptual Loss (LPIPS): Compares feature maps from pre-trained convolutional networks rather than raw pixels, focusing on structural similarity rather than exact pixel alignment.
- Adversarial Loss (GANs): A discriminator network is trained to distinguish between real images and decoder outputs, forcing the decoder to produce sharper, more realistic details.
3. Representing Conditions (Text and Images)
- Text Embeddings: Uses Transformers and tokenization (subword level). The attention mechanism allows tokens to be represented as a function of others. The final embedding of the last token in a decoder-only transformer often captures the global semantic meaning of the sentence.
- Image Embeddings: Uses Vision Transformers (ViT), which treat image patches as tokens.
- CLIP (Contrastive Language-Image Pre-training): Projects text and image embeddings into a shared space. It uses a symmetric cross-entropy loss to maximize the similarity of true image-caption pairs while pushing unrelated pairs apart.
4. Guided Generation Frameworks
The goal is to steer the diffusion/flow-matching process toward a condition $Y$.
- Classifier Guidance: Uses a separate classifier to provide gradients that shift the mean of the diffusion process.
- Drawback: Requires training a classifier on noisy images and performing expensive backward passes during inference.
- Classifier-Free Guidance (CFG):
- Methodology: During training, the model is trained on both unconditional (no prompt) and conditional (with prompt) inputs.
- Inference: The noise prediction is calculated as: $\epsilon_{guided} = \epsilon_{uncond} + w \cdot (\epsilon_{cond} - \epsilon_{uncond})$.
- Guidance Scale ($w$): A hyperparameter that controls how strongly the model adheres to the condition. Higher values increase adherence but may introduce artifacts.
5. Synthesis and Conclusion
The modern generation paradigm involves:
- Training a VAE to map images to a smooth, semantically meaningful latent space.
- Training a diffusion/flow-matching model in this latent space to handle the generation process efficiently.
- Using CFG to incorporate conditions (text/images) without the overhead of external classifiers.
The encoder acts as a low-pass filter focusing on semantic meaning, while the decoder is tasked with reconstructing high-frequency perceptual details. This separation of concerns allows for scalable, high-quality image generation.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance". What would you like to know?