Build real-time multimodal agents with Gemini and Pipecat

By Google for Developers

Share:

Key Concepts

  • Pipekit: An open-source, vendor-neutral framework for real-time, multimodal agent orchestration.
  • Gemini 3: The latest multimodal real-time model featuring improved instruction following, enhanced tool-calling capabilities, and native Google Search grounding.
  • Multimodal Real-time Interaction: The ability for an AI agent to process voice, text, and data simultaneously in a low-latency environment.
  • Agent Orchestration: The process of managing multiple specialized AI agents (e.g., a concierge and a language tutor) within a single application.
  • Thinking Level: A configuration parameter that controls the model's internal reasoning process; setting it to "minimal" reduces latency for real-time voice interactions.

1. Main Topics and Technical Implementation

The video demonstrates building a sophisticated travel-planning voice agent using Pipekit and Gemini 3.

  • Scaffolding: Developers use the Pipekit CLI (uv tool install pipekit-aicli) and the pipekit init command to generate the bot structure.
  • Infrastructure: Pipekit is vendor-neutral, allowing deployment on Google Cloud Platform or local infrastructure without external dependencies.
  • Performance Optimization: By setting the "thinking level" to "minimal," the developer significantly reduces the time-to-first-token, ensuring the voice agent responds quickly during natural conversation.
  • Instruction Following: Gemini 3 demonstrates high stability in maintaining context over long (15+ minute) conversations, reducing the need for complex state-machine "flows" in favor of robust system prompts.

2. Tool Calling and Integration

The agent utilizes specific function handlers to bridge the gap between the LLM and external data:

  • Flight/Lodging Search: The developer defines search_flights and search_lodging schemas. These are registered with the LLM, allowing the model to trigger specific Python functions when the user requests information.
  • Google Grounding: Enabled by simply toggling the feature, allowing the model to access real-time web data via Google Search.
  • Multimodality: The agent can output structured data (Markdown) to save trip reports to disk, which can then be re-read into the context to maintain state across different sessions.

3. Multi-Agent Framework

The developer introduces the PyICat agents module to manage complex interactions:

  • Architecture: The framework uses a shared message bus to facilitate communication between multiple agents.
  • Implementation: A GeminiLiveAgent base class is used to create specialized subclasses:
    • Concierge Agent: Handles travel logistics, flight searches, and lodging.
    • Language Tutor Agent: Dedicated to teaching the user Italian phrases.
  • Transfer Logic: The handle_transfer function allows the system to switch between agents seamlessly while maintaining the conversation context.

4. Real-World Application: Travel Planning

The case study involves planning a trip to Italy for five people.

  • Process: The agent gathers requirements (destination, dates, group size, preferences like "central location" or "pool access"), performs mock searches, and filters results based on user feedback.
  • Data Handling: The agent successfully manages constraints (e.g., finding hotels with pools) and saves the final itinerary as a persistent file.

5. Notable Quotes

  • "I know it sounds kind of funny to ask the model to think as little as possible when it talks to us, but this has a really big impact on how quickly the bot will answer us." — Chad Bailey, on optimizing Gemini 3 for real-time voice.
  • "This is important for any voice agent that needs to do anything other than talking, which is pretty much any voice agent that’s actually worth using." — On the necessity of robust tool-calling capabilities.

6. Synthesis and Conclusion

The integration of Gemini 3 into the Pipekit framework represents a shift toward more capable, low-latency voice agents. By leveraging minimal thinking levels for speed, native tool calling for data retrieval, and multi-agent orchestration for specialized tasks, developers can build highly complex, stateful applications. The framework’s open-source, vendor-neutral nature provides a flexible foundation for deploying these agents across various platforms, from web interfaces to telephony systems.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Build real-time multimodal agents with Gemini and Pipecat". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video