Stanford CME296 Diffusion & Large Vision Models | Spring 2026 | Lecture 6 - Model Training
By Unknown Author
Key Concepts
- Diffusion Transformer (DiT): An architecture using self-attention to model global dependencies in image generation.
- Flow Matching: A framework interpreting image generation as a transport problem (vector field regression) rather than iterative denoising.
- Logit-Normal Distribution: A sampling strategy for time steps ($t$) that emphasizes "hard" middle-range noise levels over easy early/late stages.
- Time-Step Shifting: A technique to rescale noise levels based on image resolution to maintain consistent perceived noise.
- REPA (Representation Alignment): A method to speed up training by aligning internal model representations with pre-trained encoders.
- Curriculum Learning: Progressively increasing the difficulty of training data (e.g., resolution, aspect ratio, prompt complexity).
- Preference Tuning: Methods (Reward Feedback Learning, Flow-GPO, Diffusion DPO) to align model outputs with human aesthetic preferences.
- Distillation: Techniques (Progressive Distillation, Instaflow, Consistency Models, ADD) to reduce the number of inference steps required for generation.
- LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method using low-rank matrices to update model weights.
1. The Training Life Cycle
The process of creating a production-ready text-to-image model is divided into three distinct phases:
- Pre-training: Teaching the model to generate general images. This is compute-intensive and relies heavily on large-scale data pipelines.
- Post-training: Refining the model to generate "good" images. This includes Continued Training (domain-specific knowledge) and Supervised Fine-Tuning (aesthetic/instruction following).
- Tuning (Personalization): Adapting the model to specific subjects (e.g., DreamBooth) using rare tokens and prior preservation loss to prevent catastrophic forgetting.
2. Loss Functions and Training Optimization
- Flow Matching Loss: The current industry standard. It regresses the vector field (velocity) between noise ($X_0$) and target ($X_1$).
- Time-Step Sampling: Instead of uniform sampling, the Logit-Normal distribution is used to focus training on middle time steps ($t \approx 0.5$), which are identified as the most "difficult" for the model to learn.
- Resolution-Aware Shifting: Because low-resolution images appear noisier than high-resolution ones at the same noise level, time steps are shifted to normalize the perceived noise across different resolutions.
3. Preference Tuning and Alignment
To ensure models align with human preferences, several methods are employed:
- Reward Feedback Learning: Uses a differentiable reward model to backpropagate scores into the generation model.
- Flow-GPO (Group Reward Policy Optimization): Leverages SDEs to generate diverse samples for a prompt, then updates the policy based on relative rewards within that group.
- Diffusion DPO: Directly optimizes the model to increase the probability of "winning" images (high reward) and decrease the probability of "losing" images (low reward) by adjusting velocity predictions.
- Prompt Enhancement: A "waiter" mechanism that expands simple user prompts into detailed, high-quality prompts that match the model's training distribution.
4. Distillation for Efficiency
To reduce latency and computational costs, distillation techniques aim to reduce the number of iterative steps:
- Progressive Distillation: Iteratively halves the number of steps required, with a student model learning to mimic the teacher's output in fewer passes.
- Instaflow: Combines Reflow (straightening the probability flow paths) with distillation to enable single-step generation.
- Consistency Models (LCM): Trains the model to map any point on a trajectory to the final image ($X_1$), allowing for rapid generation.
- Adversarial Distillation (ADD): Uses GAN-like objectives to ensure generated images remain crisp, moving beyond the "regression to the mean" often caused by simple MSE losses.
5. Notable Perspectives
- The "Painter" Analogy: Shervin compares distillation to a master painter teaching a student; asking for a single-stroke masterpiece is impossible, so the student must learn through granular, iterative steps.
- The "Restaurant" Analogy: Prompt enhancement is compared to a waiter rephrasing a customer's vague request ("I want meat") into a specific order ("medium-rare steak with salt") that the kitchen (the model) is trained to execute perfectly.
- The "Book" Analogy: REPA is described as giving a student a textbook to learn a topic faster, rather than forcing them to learn from scratch.
Synthesis
The transition from theoretical foundations to practical training involves a shift toward Flow Matching and Latent Space operations. The current trend emphasizes efficiency (distillation) and alignment (preference tuning). While pre-training provides the foundational knowledge, the "magic" of modern models lies in the post-training and distillation phases, which transform raw generative capabilities into high-fidelity, user-aligned, and computationally efficient production tools.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.