Build real-time multimodal agents with Gemini and Pipecat
By Google for Developers
Key Concepts
- Pipekit: An open-source, vendor-neutral framework for real-time, multimodal agent orchestration.
- Gemini 3: The latest multimodal real-time model featuring improved instruction following, enhanced tool-calling capabilities, and native Google Search grounding.
- Multimodal Real-time Interaction: The ability for an AI agent to process voice, text, and data simultaneously in a low-latency environment.
- Agent Orchestration: The process of managing multiple specialized AI agents (e.g., a concierge and a language tutor) within a single application.
- Thinking Level: A configuration parameter that controls the model's internal reasoning process; setting it to "minimal" reduces latency for real-time voice interactions.
1. Main Topics and Technical Implementation
The video demonstrates building a sophisticated travel-planning voice agent using Pipekit and Gemini 3.
- Scaffolding: Developers use the Pipekit CLI (
uv tool install pipekit-aicli) and thepipekit initcommand to generate the bot structure. - Infrastructure: Pipekit is vendor-neutral, allowing deployment on Google Cloud Platform or local infrastructure without external dependencies.
- Performance Optimization: By setting the "thinking level" to "minimal," the developer significantly reduces the time-to-first-token, ensuring the voice agent responds quickly during natural conversation.
- Instruction Following: Gemini 3 demonstrates high stability in maintaining context over long (15+ minute) conversations, reducing the need for complex state-machine "flows" in favor of robust system prompts.
2. Tool Calling and Integration
The agent utilizes specific function handlers to bridge the gap between the LLM and external data:
- Flight/Lodging Search: The developer defines
search_flightsandsearch_lodgingschemas. These are registered with the LLM, allowing the model to trigger specific Python functions when the user requests information. - Google Grounding: Enabled by simply toggling the feature, allowing the model to access real-time web data via Google Search.
- Multimodality: The agent can output structured data (Markdown) to save trip reports to disk, which can then be re-read into the context to maintain state across different sessions.
3. Multi-Agent Framework
The developer introduces the PyICat agents module to manage complex interactions:
- Architecture: The framework uses a shared message bus to facilitate communication between multiple agents.
- Implementation: A
GeminiLiveAgentbase class is used to create specialized subclasses:- Concierge Agent: Handles travel logistics, flight searches, and lodging.
- Language Tutor Agent: Dedicated to teaching the user Italian phrases.
- Transfer Logic: The
handle_transferfunction allows the system to switch between agents seamlessly while maintaining the conversation context.
4. Real-World Application: Travel Planning
The case study involves planning a trip to Italy for five people.
- Process: The agent gathers requirements (destination, dates, group size, preferences like "central location" or "pool access"), performs mock searches, and filters results based on user feedback.
- Data Handling: The agent successfully manages constraints (e.g., finding hotels with pools) and saves the final itinerary as a persistent file.
5. Notable Quotes
- "I know it sounds kind of funny to ask the model to think as little as possible when it talks to us, but this has a really big impact on how quickly the bot will answer us." — Chad Bailey, on optimizing Gemini 3 for real-time voice.
- "This is important for any voice agent that needs to do anything other than talking, which is pretty much any voice agent that’s actually worth using." — On the necessity of robust tool-calling capabilities.
6. Synthesis and Conclusion
The integration of Gemini 3 into the Pipekit framework represents a shift toward more capable, low-latency voice agents. By leveraging minimal thinking levels for speed, native tool calling for data retrieval, and multi-agent orchestration for specialized tasks, developers can build highly complex, stateful applications. The framework’s open-source, vendor-neutral nature provides a flexible foundation for deploying these agents across various platforms, from web interfaces to telephony systems.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Build real-time multimodal agents with Gemini and Pipecat". What would you like to know?