Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 4 - Latent Space & Guidance

Key Concepts

Multimodal Guided Generation: The process of generating images conditioned on external inputs like text prompts or reference images.
Latent Space: A lower-dimensional, compressed representation of data that is more computationally tractable than raw pixel space.
Variational Autoencoder (VAE): An architecture consisting of an encoder (compressing input to a latent distribution) and a decoder (reconstructing the input).
Semantic vs. Perceptual Similarity: Semantic similarity refers to global structure/meaning; perceptual similarity refers to local textures and human-perceived quality.
Classifier-Free Guidance (CFG): A technique to guide generation by interpolating between unconditional and conditional noise predictions, eliminating the need for a separate classifier.
Contrastive Learning (CLIP): A method to align text and image embeddings in a shared space to enable cross-modal conditioning.

1. Representing Noisy Images: The Latent Space

The lecture addresses the inefficiency of working in pixel space (e.g., $1024 \times 1024$ images result in $\sim 10^6$ dimensions). Pixel space is redundant, high-dimensional, and lacks meaningful structure for diffusion models.

The Wish List: A space that is tractable, compact, and meaningful (where valid images form clusters).
Autoencoder Framework: Uses an encoder ($E_\phi$) to compress images into a latent representation ($z$) and a decoder ($D_\theta$) to reconstruct them.
Variational Autoencoder (VAE): To structure the latent space, the encoder outputs a mean ($\mu$) and variance ($\sigma^2$) of a distribution. The latent $z$ is sampled from this distribution, which is regularized to approximate a standard normal distribution (the prior).
Loss Function: Composed of a reconstruction loss (pixel-wise L2) and a regularization loss (KL divergence between the latent distribution and the prior).

2. Combating Blur and Improving Quality

Standard VAEs often produce blurry images due to the pixel-wise L2 loss. Two strategies are introduced:

Perceptual Loss (LPIPS): Compares feature maps from pre-trained convolutional networks rather than raw pixels, focusing on structural similarity rather than exact pixel alignment.
Adversarial Loss (GANs): A discriminator network is trained to distinguish between real images and decoder outputs, forcing the decoder to produce sharper, more realistic details.

3. Representing Conditions (Text and Images)

Text Embeddings: Uses Transformers and tokenization (subword level). The attention mechanism allows tokens to be represented as a function of others. The final embedding of the last token in a decoder-only transformer often captures the global semantic meaning of the sentence.
Image Embeddings: Uses Vision Transformers (ViT), which treat image patches as tokens.
CLIP (Contrastive Language-Image Pre-training): Projects text and image embeddings into a shared space. It uses a symmetric cross-entropy loss to maximize the similarity of true image-caption pairs while pushing unrelated pairs apart.

4. Guided Generation Frameworks

The goal is to steer the diffusion/flow-matching process toward a condition $Y$.

Classifier Guidance: Uses a separate classifier to provide gradients that shift the mean of the diffusion process.
- Drawback: Requires training a classifier on noisy images and performing expensive backward passes during inference.
Classifier-Free Guidance (CFG):
- Methodology: During training, the model is trained on both unconditional (no prompt) and conditional (with prompt) inputs.
- Inference: The noise prediction is calculated as: $\epsilon_{guided} = \epsilon_{uncond} + w \cdot (\epsilon_{cond} - \epsilon_{uncond})$.
- Guidance Scale ($w$): A hyperparameter that controls how strongly the model adheres to the condition. Higher values increase adherence but may introduce artifacts.

5. Synthesis and Conclusion

The modern generation paradigm involves:

Training a VAE to map images to a smooth, semantically meaningful latent space.
Training a diffusion/flow-matching model in this latent space to handle the generation process efficiently.
Using CFG to incorporate conditions (text/images) without the overhead of external classifiers.

The encoder acts as a low-pass filter focusing on semantic meaning, while the decoder is tasked with reconstructing high-frequency perceptual details. This separation of concerns allows for scalable, high-quality image generation.