Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
By AI Engineer
Share:
Key Concepts
- Diffusion Models: A generative modeling paradigm that creates data by learning to reverse a gradual noise-corruption process.
- Latent Representation: A compressed, lower-dimensional version of data (e.g., images/video) used to make training computationally feasible.
- Denoising: The core mechanism where a neural network predicts and removes noise from a signal to reconstruct the original data.
- Guidance: A technique to trade off sample diversity for higher quality by amplifying the difference between conditional and unconditional predictions.
- Distillation: A process to reduce the number of sampling steps required to generate a high-quality output.
- Spectral Auto-regression: An intuitive way to view diffusion as a process that generates data from coarse (low-frequency) to fine (high-frequency) details.
1. Data Curation and Representation
- Data Curation: Sander emphasizes that data quality is often more critical than model architecture or optimizer tuning. In large-scale generative modeling, manual inspection and curation of datasets are essential "secret sauce" components.
- Latent Space Compression: Training directly on high-resolution pixels is memory-prohibitive. Instead, researchers use Autoencoders (Encoder-Decoder networks) to compress data into a latent space.
- Mechanism: The encoder maps input to a compact latent grid (e.g., 256x256 pixels to a 32x32 latent grid).
- Benefit: This preserves topological structure while reducing memory usage by orders of magnitude, allowing for efficient training on video and high-res images.
2. The Diffusion Mechanism
- Corruption Process: Information is destroyed by gradually adding Gaussian noise to the data.
- Denoising Process: The model acts as a denoiser. Because the model cannot perfectly predict the original image from a noisy input, it predicts a "region" of potential images.
- Iterative Refinement: Sampling involves taking small, incremental steps toward the predicted region. Adding a small amount of noise back during these steps helps prevent the accumulation of errors caused by the neural network's imperfections.
- Frequency Analysis: Diffusion acts as "spectral auto-regression." By analyzing the Fourier transform of images, it is clear that noise obscures high-frequency details first. Diffusion models effectively reconstruct images by starting with low-frequency global structures and progressively adding high-frequency details.
3. Architecture and Training
- U-Nets vs. Transformers: While U-Nets were the standard for early diffusion models, Transformers are increasingly used due to their scalability. Unlike LLMs, diffusion-based Transformers do not require causal masks, allowing for fully bidirectional attention.
- Video Generation: Modern models treat the 3D volume (height × width × time) as a single entity to be denoised. Hybrid approaches, such as Genie, use auto-regression in time combined with diffusion for individual frame generation.
- Scaling: Training at scale requires both data parallelism (splitting batches) and model parallelism (sharding the model across chips). The speaker notes that Jax is preferred at Google for its efficient handling of TPU interconnects and automatic sharding.
4. Guidance and Control
- Guidance Scale: A hyperparameter that allows users to trade off diversity for quality. High guidance scales produce images that are more faithful to the prompt but less diverse.
- Mechanism: Guidance amplifies the difference ($\Delta$) between an unconditional prediction and a conditional prediction (e.g., with a text prompt). This "pushes" the sampling trajectory toward the desired semantic region.
- Control Signals: Beyond text, researchers are exploring reference-based generation (e.g., using a photo of a person) and explicit controls like camera motion or event timing. These are often introduced during post-training phases.
5. Distillation and Sampling
- Consistency Models: A form of distillation that aims to reduce the number of sampling steps. Instead of predicting the next step in a trajectory, the model is trained to predict the final endpoint, potentially allowing for single-step generation.
- Deterministic vs. Stochastic: While stochastic sampling (adding noise) is robust to errors, deterministic sampling is often required for specific distillation techniques and provides a consistent mapping between noise and output.
Synthesis and Conclusion
The talk highlights that modern generative media is moving away from raw pixel-based training toward latent-space diffusion. The success of these models relies on a combination of high-quality data curation, efficient latent representations, and the guidance trick, which allows models to "punch above their weight" in terms of quality. While diffusion models are currently the industry standard for audiovisual data, the field is actively evolving through distillation techniques to reduce computational latency and through advanced conditioning methods to provide users with finer control over generated content.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.