Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
By AI Engineer
Key Concepts
- Generative Media (Gen Media): Google DeepMind’s suite of models capable of creating and understanding various modalities (text, image, video, audio).
- Multimodality: The ability of a model to ingest and output multiple types of data (e.g., text-to-video, image-to-audio).
- World Models: AI systems designed to understand the world by processing diverse sensory inputs (audio, video, sensors) and generating corresponding outputs.
- Developer Advocate: A role focused on bridging the gap between internal engineering teams and external developers by providing documentation, code samples, and feedback.
- API Tiers: Different service levels for API usage, including Standard, Flex (50% discount, delayed processing), and Priority (2x cost, faster processing).
- Stateful vs. Stateless APIs: The transition toward "Interactions API," which allows for stateful sessions where context is stored on the server, reducing the need to re-upload data.
- Prompt Engineering: The practice of using detailed, structured instructions to guide model output, including style, character consistency, and technical parameters (BPM, scale, duration).
1. Overview of Google DeepMind’s Gen Media Ecosystem
The speaker, a Developer Advocate at DeepMind, emphasizes that media (images, video, sound) is core to modern AI. DeepMind’s vision is to build a unified "World Model" that handles all modalities. While they currently ship specific models for release efficiency, the underlying goal is a single, multimodal architecture.
- Key Models Mentioned:
- Gemini: Multimodal LLM (1.5 and 2.0 versions).
- Nano Banana (Imagen): Image generation model with search and image grounding capabilities.
- Veo (V3.1): Video generation model; includes a "Light" version for cost-effective iteration.
- Lia: Music generation model (Clip for 30s, Full for 3m) and Lia Real-Time (a predictive model for live DJ-style mixing).
- Gemma 4: Open-model series.
2. Methodologies and Frameworks
The presentation focused on a practical workshop using a Colab Notebook to illustrate a book (The Wind in the Willows).
- The "Cookbook" Approach: DeepMind maintains a GitHub repository ("Cookbook") containing quick-start guides and complex examples for developers.
- Workflow for Content Creation:
- Context Loading: Uploading the source text to Gemini using the large context window.
- Character Consistency: Using system instructions to define character appearances (e.g., clothing, physical traits) to ensure visual continuity across generated images.
- Structured Output: Using JSON schemas to force the model to output consistent data structures (e.g.,
{"name": "...", "prompt": "..."}). - Video Generation: Using an image as the "first frame" for Veo to ensure the video starts with the correct visual context.
- Audio/Music Integration: Using Gemini to write prompts for Lia, ensuring the music matches the tone of specific chapters.
3. Technical Insights and Best Practices
- API Management: The speaker highlighted the difference between AI Studio (easy testing), Vertex AI (enterprise-grade control), and the Gemini Developer API (middle ground).
- Cost Optimization:
- Use the Flex tier for non-urgent tasks to save 50%.
- Use the Priority tier for time-sensitive demos.
- Iterate with smaller models (e.g., Veo Light) before upscaling to high-resolution versions.
- Prompting Strategy:
- Length Matters: Longer, more descriptive prompts yield better results.
- System Instructions: Essential for preventing unwanted behaviors (e.g., adding titles to book covers).
- TTS Trick: To simulate multiple voices with a single TTS model, define a transcript format where the narrator and characters are assigned specific "speaking styles" within the prompt.
4. Notable Quotes
- "Last year was everybody talking about agents. This year is the year where we are actually going to build agents."
- "The most important part of generating a video is generating the first frame so that it knows where to start."
- "The longer your prompt, the more interesting it’s going to be and the more likely it’s going to be following what you’re asking for."
5. Synthesis and Conclusion
The session demonstrated that modern generative media is no longer about isolated model calls but about orchestrating multiple models (Gemini for logic/prompting, Imagen for visuals, Veo for motion, and Lia for audio). The key takeaway for developers is to leverage the large context window of Gemini to maintain narrative consistency and to utilize the "Cookbook" resources to understand how to chain these models effectively. The speaker also acknowledged the ongoing challenges regarding data sovereignty and regional availability (specifically in Europe), noting that these are high-priority issues for the developer advocacy team.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.