Automate Product Listings with Gemini + Vision Agents
By Google for Developers
Key Concepts
- Gemini 3.1 Flash Live Model: A low-latency, multimodal AI model capable of real-time interaction, vision processing, and complex instruction following.
- Vision Agents SDK (Stream): A framework for building AI agents that interact with live video feeds and external tools.
- Function Calling: The ability for an LLM to trigger predefined code functions (e.g., image processing, web search) based on user needs.
- Instruction Following: The model's capacity to adhere to a strict, multi-step workflow defined in system prompts, even when prompted to deviate.
- Video Processors: Components that analyze live video frames to provide real-time guidance or trigger events (like taking a screenshot).
- Orchestration: The process of managing the agent's workflow, either through manual code logic or, as demonstrated, through structured system instructions.
1. Project Overview and Workflow
The project demonstrates an AI-powered assistant designed to help users create professional product listings for used items. The workflow involves:
- Object Detection: Analyzing a live video feed to identify the item.
- Image Capture & Polishing: Capturing a screenshot and using an image-generation tool to create a clean, professional background.
- Product Research: Performing a web search to gather details about the item.
- Listing Generation: Compiling the gathered information into a ready-to-use product description.
2. Technical Implementation
The application is built using Python, with the following key components:
- Infrastructure: The
Vision Agents SDKhandles the connection to the Gemini model and manages the agent's lifecycle. - Agent Initialization: The
create_agentfunction defines the LLM, the infrastructure edge (getstream.Edge), and the agent's persona. - Real-time Processing: A
VideoProcessorclass (specificallyObjectCaptureProcessor) subscribes to the user's video feed. It analyzes frames to determine if the object is clear enough to capture, providing guidance if the user needs to adjust the item's position. - Tool Integration: Custom tools are registered using the
@llm.register_functiondecorator. This allows the agent to dynamically call:- Image Polishing: Uses
Gemini 3.1 Flash Image Previewto process image bytes. - Product Search: Uses a
google_searchtool and validates the output against a JSON schema (ProductDetails) to ensure structured data.
- Image Polishing: Uses
3. Orchestration and Instruction Following
Instead of writing a rigid, manual Orchestrator class, the developers leveraged the Gemini model's advanced instruction-following capabilities.
- Methodology: A markdown file defines the agent's persona and a strict, ordered list of steps.
- Evidence of Capability: During a test, the user attempted to "jailbreak" the agent by asking it to skip the screenshot step and proceed directly to the description. The model successfully resisted, stating: "My purpose is to guide you through the steps in a specific order... Shall we do the screenshot now?"
4. Development Framework
- Environment: Initialized using
uv init. - Dependencies: Includes
vision-agentsSDK,google-genaipackage, andNext.jsfor the front-end. - Communication: Uses event-based WebSocket connections to notify the front-end of the agent's progress through the workflow.
5. Notable Quotes
- "The reduced latency on the latest Gemini model really helps make this conversation flow feel very intuitive and natural." — Stefan Blos
- "Due to its reliability on function calling and instruction following, we can see that it directly translates and follows the tasks that we define in the instructions." — Stefan Blos
6. Synthesis and Conclusion
The integration of the Gemini 3.1 Flash Live Model with the Vision Agents SDK significantly lowers the barrier to entry for building complex, real-world AI applications. By offloading the orchestration logic to the model's instruction-following capabilities, developers can create intuitive, step-by-step user experiences that handle complex tasks—such as image processing and web research—without needing to build extensive custom infrastructure. The result is a highly reliable, conversational agent that maintains context and adheres to defined workflows, even under adversarial user input.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "Automate Product Listings with Gemini + Vision Agents". What would you like to know?