Stanford CS25: V5 I Transformers in Diffusion Models for Image Generation and Beyond
By Stanford Online
Key Concepts
- Diffusion Models & Flow Matching: Iterative denoising processes for image generation.
- Latent Space vs. Pixel Space Diffusion: Operating in a compressed latent space vs. directly on pixels.
- U-Net Architecture: Early dominant architecture for diffusion models, characterized by downsampling and upsampling blocks with skip connections.
- Diffusion Transformers: Replacing convolutional layers in U-Nets with transformer blocks for improved scaling and integration with LLMs.
- Adaptive Layer Normalization (AdaLN): A technique to inject conditioning information (e.g., time step, class labels, text embeddings) into transformer blocks.
- Multi-Modal Diffusion Transformer (MMDiT): A variant of attention that models dependencies of different modalities (e.g., text and image) in separate spaces.
- Parameter Sharing: Sharing parameters (e.g., QKV projections, MLP layers) across transformer blocks to improve efficiency.
- In-Context Learning: Enabling diffusion models to generate images based on a few example images, similar to LLMs.
Introduction to Diffusion Models and Flow Matching
The talk focuses on generative aspects of visual modality, specifically diffusion models and their architectures for image generation. It starts with examples of text-to-image generation, highlighting the photorealism achieved by open models. Diffusion models are presented as iterative denoising processes, starting from random noise and refining it into a photorealistic image. Conditioning the denoising process with text allows for generating abstract creatures.
Components of a Text-to-Image Model
A mental model of a state-of-the-art text-to-image model is developed, outlining the necessary components and their connections:
- Text Encoders: Embed text prompts into numerical representations. Modern models like Stable Diffusion 3 use multiple text encoders.
- Noisy Latents: Random noise drawn from a Gaussian distribution, serving as the starting point for the denoising process.
- Time Step: A value indicating the amount of noise added to the latents, influencing the denoising aggressiveness.
- Core Diffusion Network: A neural network that iteratively refines the noisy latents based on the text embeddings and time step.
- Decoder Model: Converts the refined latents into an image.
The talk distinguishes between latent space diffusion models (more common due to computational efficiency) and pixel space diffusion models.
Training and Inference
The training process involves adding noise to clean images and making the model predict the amount of noise added (epsilon objective). During sampling, the noise prediction is repeated sequentially to denoise a random noise vector into an image.
Flow matching is introduced as an alternative approach where noise and clean data are connected through a straight path, simplifying the process compared to diffusion models.
Core Requirements of a Diffusion Model
The core requirements for a diffusion model are:
- Dealing with noisy inputs.
- Handling conditions (text, class, time step).
- Modeling dependencies between noisy inputs and conditions.
- Producing final outputs (decoding or upsampling).
Early Architectures: U-Net
Early diffusion models like DDPM and latent diffusion models used U-Net-based architectures. The U-Net architecture consists of:
- Input convolutional stem.
- Down blocks (convolutional and transformer blocks).
- Middle block (no resolution change).
- Up blocks (upsampling layers).
- Output layer.
The U-Net architecture is described as "giant" and "prohibitive" due to its complexity.
Transition to Transformers
The motivation for transitioning to pure transformer-based architectures includes:
- Benefiting from advancements in transformer architectures (SwiGLU, QK normalization).
- Good scaling properties.
- Easy integration with LLM-based backbones.
- Eliminating the giant U-Net.
The forward pass in a vision transformer (ViT) is presented as a familiar starting point, with modifications for image generation.
Diffusion Transformer Details
Key components of a diffusion transformer:
- Time Step Embedding: Time steps are embedded into sinusoidal frequencies and passed through a shallow MLP.
- Class Label Embedding: A simple NN embedding layer is used.
- Patchification: Done with a convolutional stem.
- Positional Encodings: Standard sine/cosine scheme.
- Adaptive Layer Norm (AdaLN): Used to inject conditioning information into the transformer blocks.
Instead of cross-attention, self-attention is used, and the conditioning is modulated along with self-attention. The final outputs are obtained through a single-layer decoder and unpatchification.
Adaptive Layer Normalization (AdaLN)
Adaptive Layer Normalization is crucial for modeling stylistic aspects in images. It involves modulating the standard layer norm with parameters learned from the condition space (time step and class embeddings). AdaLN performs better than cross-attention and is more compute-efficient.
Pixart Alpha: Text-to-Image with Transformers
Pixart Alpha is an early work that enabled text-to-image generation in a diffusion transformer architecture. It uses:
- A text encoder (Flan T5XXL) to embed text prompts.
- Self-attention on noisy latents.
- Cross-attention between noisy latents and text embeddings.
- Initialization from a class-conditional diffusion model.
Pixart Alpha uses embedding tables to modulate time step embeddings, reducing computation by 27%.
Addressing Quadratic Complexity
Vanilla attention has quadratic time and memory complexity, which becomes prohibitive for high-resolution images. Solutions include:
- Operating on a more compressed latent space.
- Using linear attention mechanisms.
Sana Architecture
Sana uses both a more compressed latent space and a linear variant of attention. It employs:
- Self-linear attention (no n² computation).
- Cross-attention to model dependencies between noisy latents and text prompts.
- MixFFN blocks (inverted residual blocks and point-wise convolutions) to model local dependencies.
- No positional embeddings.
Multi-Modal Diffusion Transformer (MMDiT)
MMDiT models dependencies of different modalities (text and image) in separate spaces. It involves:
- Separate QKV projections for text embeddings and noisy latents.
- Concatenation of the projected representations before computing attention.
- Separate adaptive layer norm matrices for each modality.
MMDiT aims to mitigate biases in text embeddings and allows for co-evolution of embeddings from different modalities.
Simplifying the Design
The talk explores ways to simplify the diffusion transformer design:
- Parameter sharing (QKV, MLP, AdaLN).
- Self-attention on a concatenated space of image tokens and text tokens.
Apple's work (Date) demonstrates that a simplified design with parameter sharing and self-attention on a concatenated space can achieve good performance.
Injecting More Control
Methods for injecting more control into text-to-image models:
- Learning an auxiliary network to compute salient representations from structural image signals (ControlNet).
- Increasing input channels to accept more controls (Flux Control Framework).
- Learning a small adapter network to model dependencies between conditions and noisy latent tokens.
Next Generation Architectures
Next-generation architectures aim to enable in-context learning in diffusion models. This involves:
- Starting from a pre-trained LLM.
- Adding components to generate images (e.g., auto-regression on discrete tokens, diffusion on continuous tokens).
Examples include Bagel, Lada, Mada, and Transfusion.
Conclusion
The talk provides a comprehensive overview of diffusion models and their architectures for image generation, covering early U-Net-based models, the transition to transformers, techniques for improving efficiency and control, and next-generation architectures for in-context learning. It emphasizes the importance of adaptive layer normalization, multi-modal diffusion transformers, and parameter sharing. The talk concludes by highlighting promising directions for future research, including exploring hybrid architectures and mechanistic interpretability.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Stanford CS25: V5 I Transformers in Diffusion Models for Image Generation and Beyond". What would you like to know?