Back to all videos

Build agents with Gemini API

By Google for Developers

AI Agent Development Multimodal AI LLM API Integration

Share:

Key Concepts

Gemini API: Google’s platform for accessing DeepMind models.
Gemini 3.1 Flash Live: A multimodal, real-time conversational model supporting low-latency voice, vision, and text.
Interactions API: A unified interface for managing interactions with models and agents, featuring server-side state management.
Managed Agents (Antigravity): Hosted agents with persistent, sandboxed environments for code execution and file management.
Multimodality: The ability of models to process and generate text, audio, video, and code simultaneously.
Ephemeral Tokens: Short-lived security tokens used for secure client-side connections to the Gemini API.

1. Gemini Live API: Real-Time Conversational AI

The Gemini Live API enables low-latency, bidirectional communication. It utilizes a stateful WebSocket connection to stream text, audio, and video frames.

Capabilities: The model performs "speech-to-speech" processing, meaning it does not rely solely on text transcription for reasoning. It can handle 90 languages and supports real-time interruptions.
Tool Use: It integrates with Google Search Grounding for real-time information and supports custom function calls (e.g., generating music via Lyria 3).
Ecosystem: Developers can integrate the API using frameworks like LiveKit, Daily, Pipecat, and Firebase.

2. Interactions API: Unified State Management

The Interactions API simplifies the development of multi-turn applications by moving state management from the client to the server.

Server-Side State: Instead of maintaining a massive array of conversation history on the client, developers receive an ID after the first call. Subsequent turns use this ID to maintain context automatically.
Steps Data Model: The API moves away from simple "user/model" turns to a "steps" model. Each action—such as a function call, a thought process, or a result—is treated as an individual, distinct step.
Modality Flexibility: The same API structure is used for text, speech generation (Gemini 3.1 Flash TTS), and image generation (Nano Banana Pro).

3. Managed Agents and Remote Sandboxing

A significant advancement is the introduction of Managed Agents (Antigravity), which provide a fully remote, persistent sandbox for agents.

Remote Execution: Unlike traditional coding agents that run on the user's local machine, these agents operate in a "small box" on Google’s infrastructure. This allows for asynchronous tasks and easier scaling.
Environment Persistence: Agents can share an environment ID. This allows multiple agents (e.g., a research agent and an app-builder agent) to access the same files and workspace.
Security: To prevent agents from misusing credentials, Google implements a network proxy. The agent is provided with a dummy header; when the agent makes an API call, the system intercepts it and injects the real, secure API key without exposing it to the agent’s code.

4. Step-by-Step: Building an Agent

Initialization: Use agent.create to define the agent’s name, base model (e.g., Antigravity Preview), and system instructions.
Scaffolding: Define the environment by pulling files from GCS buckets or GitHub repositories.
Tool Integration: Use the gemini-api-cli to install skills. The agent can then be instructed to install dependencies (e.g., pip install pandas) within its sandbox.
Execution: Trigger the agent via a single API call. The agent performs its own reasoning, tool calls, and file generation.
Retrieval: Use the environment ID to download generated files (e.g., HTML dashboards or SVG graphics) from the remote sandbox.

5. Notable Quotes

"The power of Gemini really lies in its multimodality... Gemini is really, really good at understanding all these different modalities and reasoning over them." — Thor Schaeff
"We no longer have only context shared inside of the history, we can also now share context on the environment via persistent files." — Philipp Schmid

Synthesis

The session highlights a shift toward agent-native development. By combining the Live API for real-time sensory interaction and the Interactions API for unified, stateful logic, Google is enabling developers to build complex, autonomous agents. The move to managed, remote sandboxes solves critical infrastructure and security challenges, allowing agents to perform sophisticated tasks—like coding, data analysis, and file management—entirely in the cloud without requiring local compute resources.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video