Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind

By AI Engineer

Share:

Key Concepts

  • Gemini API: Google’s interface for accessing Gemini models.
  • Interactions API: A unified, stateful API for managing models and agents, designed to simplify multi-turn conversations and tool usage.
  • Gemini Live API: A real-time, bidirectional, stateful WebSocket API for low-latency audio/video interaction.
  • Agent Skills: Modular, reusable code snippets (often via CLI) that provide agents with specific capabilities (e.g., file system access, bash execution).
  • Server-Side State: A feature of the Interactions API that stores conversation history on the server, reducing the need for client-side state management and improving cache hit rates.
  • Implicit Caching: A cost-saving mechanism where input token encodings are cached; server-side state management significantly improves cache hit rates by preventing unnecessary context fragmentation.
  • Grounding: The process of connecting an AI model to real-time data (e.g., Google Search) to reduce hallucinations and provide accurate, up-to-date information.
  • Ephemeral Tokens: Short-lived security tokens used to initiate secure, direct client-to-server WebSocket connections for the Live API.

1. The Interactions API

The Interactions API is a new, beta-stage interface designed to align with industry standards (similar to Chat Completions APIs).

  • Unified Interface: It handles both models and agents using a consistent "content block" structure (text, audio, video, function calls).
  • State Management: By providing a previous_interaction_id, the server maintains the conversation history. This eliminates the need for the client to send the entire history in every request.
  • Asynchronous Execution: Supports background processing for long-running tasks like "Deep Research," allowing for polling or webhooks rather than keeping HTTP connections open.
  • Efficiency: Improves cache hit rates by 2–3x compared to client-managed state, as the server maintains a consistent context structure.

2. Building Coding Agents

The workshop demonstrated building a coding agent using "Skills" to extend the model's capabilities.

  • Methodology:
    1. Define Tools: Create JSON schemas for tools (e.g., read_file, write_file, run_bash).
    2. Implement Logic: Write Python/TypeScript functions to execute these tools on the local machine.
    3. The Loop: Implement a while loop that sends user input to the model, checks for requires_action (function calls), executes the tool, and sends the result back until the model generates a final text response.
  • System Instructions: Crucial for defining the agent's persona (e.g., "Expert Software Engineer") and setting guardrails.

3. Gemini Live API (Real-Time Interaction)

The Live API enables native audio-to-audio communication, bypassing the traditional "transcribe-to-text -> LLM -> text-to-speech" pipeline.

  • Key Features:
    • Multimodality: Can ingest audio, video (up to 1 frame/sec), and text simultaneously.
    • Native Audio: The model processes sound tokens directly, allowing for natural language switching and "barge-in" (interrupting the model).
    • Deployment: Requires a WebSocket connection. For production, the presenters recommend using partner integrations like LiveKit or PipeCat for WebRTC support and observability.
  • Security: Uses ephemeral tokens generated on the server to allow the client to connect directly to the Live API without exposing the master API key.

4. Real-World Applications & Best Practices

  • Use Cases:
    • Shopify Sidekick: Tech support for store configuration.
    • Hey Ado: Voice companions for the elderly, leveraging multilingual capabilities.
    • Wimo: In-car conversational interfaces.
  • Hallucination Mitigation: The presenters emphasized that while demos may fail, production reliability is achieved through rigorous system instructions, clear guardrails, and grounding via Google Search.
  • Observability: For business-critical applications, a "cascading pipeline" (where each step is logged and observable) is currently preferred over the "black box" nature of native real-time audio models.

5. Synthesis and Takeaways

The workshop highlighted a shift toward stateful, agentic workflows. The Interactions API simplifies development by offloading state management to the server, while the Live API represents the future of low-latency, multimodal interaction. Developers are encouraged to use Agent Skills to provide models with specialized tools rather than relying on the model's base training. While the technology is powerful, developers must balance the "magic" of native audio models with the need for observability and security in production environments.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video
Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind - Video Summary