Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind
By AI Engineer
Key Concepts
- Any-to-Any Multimodality: The capability of AI models to process and generate multiple data types (text, code, image, audio, video).
- Agentic Architecture: A system where a reasoning model (Gemini) autonomously decides which tools to call to complete a task, rather than following a hard-coded pipeline.
- Function/Tool Calling: A mechanism allowing Gemini to trigger external specialized models (e.g., image or speech generators) based on its reasoning.
- Context Caching: An API feature that stores processed long-context data to reduce costs by up to 90% for repeated queries.
- Native Multimodal Generation: Models trained to understand the world deeply, allowing for context-aware generation (e.g., correcting math homework with visual annotations).
- Live API (Audio-to-Audio): A single-architecture model that processes audio input and generates audio output directly, bypassing traditional cascaded pipelines.
1. Multimodal Understanding with Gemini
Gemini models (specifically Gemini 3) act as the "brain" for understanding diverse inputs.
- Input Capabilities: The API supports PDFs, images, videos, audio files, URLs, and Google Search.
- Technical Specifications:
- Audio Processing: 1 minute of audio equals ~1,920 tokens. With a 1-million token limit, the model can process over 9 hours of audio.
- Video Processing: Supports approximately 1 hour of video content.
- Efficiency: Developers can use the File API for large uploads and implement Context Caching to optimize costs for repeated analysis of large files.
- Transcription: Gemini Flash is highly effective at transcribing audio directly via prompt instructions.
2. Agentic Framework for Multimodal Generation
The speaker proposes building a "Notebook LM clone" using an agentic loop rather than a static workflow.
- The Process:
- Reasoning Phase: Gemini analyzes the input (PDFs, videos, etc.) to synthesize information.
- Decision Phase: The agent determines which sections require visual aids (infographics) or audio summaries.
- Tool Execution: The agent calls specialized models (e.g., Nano Banana 2 for images, Gemini 2.5-based models for speech) to generate the required assets.
- Implementation: Developers define function declarations (name, description, parameters) and pass them to the
client.models.generate_contentmethod. The agent then decides when to invoke these functions based on the prompt.
3. Specialized Generation Models
- Image Generation: Uses "Nano Banana 2." Because it is built on Gemini, it possesses "world knowledge," allowing it to generate accurate images based on visual context (e.g., drawing on maps or correcting math problems with visual feedback).
- Speech Generation: Supports multi-speaker audio (podcast style) and can mimic specific accents (e.g., Bavarian) and tones.
- Live API: A new "Audio-to-Audio" model (Gemini 3.1 Flash Live) that enables real-time, low-latency, natural-sounding interactions without the need for separate transcription and synthesis steps.
4. Practical Application: Building the Agent
To build the agent, the speaker outlines the following steps:
- Setup: Obtain an API key from ai.studio and install the Google AI SDK.
- Data Ingestion: Upload files (PDFs, MP3s, videos) using the File API or inline data.
- Prompt Engineering: Provide the model with a clear persona (e.g., "Research Agent Partner") and define the available tools (function calls).
- Looping: Allow the model to reason about whether the generated assets are sufficient or if further generation is required.
5. Notable Quotes
- "We want to build this as an agent rather than a workflow... the agent should be able to decide what to create rather than where we hard code the pipeline." — Patrick, Google DeepMind.
- "Native generation matters because these models understand the world... it can even generate code on images for you."
6. Synthesis and Conclusion
The transition from simple text-based LLMs to "any-to-any" multimodal agents represents a shift toward more autonomous, context-aware AI systems. By leveraging Gemini’s reasoning capabilities alongside specialized generation models and the Live API, developers can create sophisticated applications that process complex, multi-source data and produce rich, interactive outputs. The key takeaway is the move away from rigid, linear pipelines toward agentic loops where the model dynamically manages the creation of content based on the user's specific needs.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.