Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 5 - Architectures

By Stanford Online

Share:

Key Concepts

  • Diffusion Models (DDPM, Score Matching, Flow Matching): Paradigms for image generation using iterative denoising or vector field transformation.
  • Latent Space: A compressed, lower-dimensional representation of images used to make generation computationally tractable.
  • U-Net: An encoder-decoder architecture with skip connections, essential for capturing both global structure and local details.
  • Vision Transformer (ViT): An architecture using self-attention to process image patches, removing the inductive bias of convolutions.
  • Diffusion Transformer (DiT): A model applying transformer blocks to latent diffusion, using patchification and condition injection.
  • Adaptive Layer Norm (AdaLN): A modulation technique to inject time-step and condition information into model embeddings.
  • Multi-Modal Diffusion Transformer (MMDiT): Architectures (e.g., Stable Diffusion 3, Flux.1) that use joint attention to process image and text modalities simultaneously.
  • Positional Embeddings (RoPE): Techniques to encode spatial information, specifically Rotary Positional Embeddings (RoPE) for 1D and 2D spatial awareness.

1. Image Generation Architectures: U-Net vs. Transformer

The lecture contrasts two primary architectural paradigms for image generation:

  • U-Net: Utilizes convolutions to scan images. It is effective for local feature extraction but struggles with long-range dependencies (e.g., generating consistent details in distant parts of an image). It relies on downsampling (encoder) and upsampling (decoder) with skip connections to preserve local details.
  • Diffusion Transformer (DiT): Replaces convolutions with self-attention. By "patchifying" images, the model allows every patch to interact with every other patch, enabling a global understanding of the image structure.

2. Condition Injection Methodologies

To guide generation (e.g., text-to-image), models must inject external signals like time-steps ($t$) and text prompts ($c$).

  • Adaptive Layer Norm (AdaLN): The most performant method for DiT. It uses an MLP to generate gate ($\alpha$), scale ($\gamma$), and shift ($\beta$) coefficients based on $t$ and $c$. These modulate the patch embeddings element-wise.
  • Cross-Attention: The model queries the text embeddings to inform the image generation process.
  • Joint Attention (MMDiT): The current state-of-the-art approach (e.g., Stable Diffusion 3). It treats image patches and text tokens as a single sequence, allowing them to interact directly within the attention layers.

3. Positional Encoding Frameworks

Because transformers are permutation-invariant, they require explicit positional information:

  • Absolute Positional Embeddings: Hard-coded or learned vectors added to the input. While simple, they are often injected at the wrong stage (input level rather than attention level).
  • Rotary Positional Embeddings (RoPE): The modern standard. Instead of adding vectors, RoPE rotates the query and key vectors based on their position.
  • 2D Spatial Handling: To handle 2D images, researchers use Axial RoPE (segregating X and Y axes) or Mixed RoPE (mixing rotations to allow interaction between axes). Mixed RoPE is shown to be more efficient and avoids artifacts caused by axis segregation.

4. Key Arguments and Evidence

  • Scalability: The DiT paper demonstrates that scaling both the transformer size and the patch size (granularity) simultaneously yields the best performance, measured by FLOPs (Floating Point Operations) and FID (Fréchet Inception Distance).
  • Inductive Bias: Convolutions have a strong inductive bias (local scanning), which is useful but limiting. Transformers remove this bias, allowing for more flexible, global interactions, which is necessary for complex, high-resolution generation.
  • Modulation Necessity: The lecture argues that simple concatenation of conditions is insufficient. Modulation (AdaLN) allows the model to dynamically highlight specific dimensions (e.g., "brownness" or "fluffiness") based on the current noise level and prompt.

5. Step-by-Step Generation Process

  1. Initialization: Sample a latent $Z_0$ from a Gaussian distribution.
  2. Patchification: Divide the latent into patches and project them into embeddings.
  3. Conditioning: Add time-step and text embeddings; inject positional information.
  4. Iterative Denoising: Pass the sequence through DiT blocks (using joint/cross-attention and AdaLN) to predict the velocity/noise.
  5. Solver Step: Use an ODE/Euler solver to update the latent ($Z_{t+1} = Z_t + \text{velocity} \cdot \Delta t$).
  6. Decoding: Once the final latent $Z_1$ is reached, pass it through a VAE decoder to reconstruct the pixel-space image.

Synthesis

The evolution of image generation models has moved from convolution-heavy U-Nets to transformer-based architectures that prioritize global context and multi-modal integration. The shift toward MMDiT and RoPE reflects a move toward more flexible, scalable, and mathematically interpretable systems. The core challenge remains balancing computational efficiency (patch size/FLOPs) with the ability to accurately interpret complex, multi-modal prompts.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video