Automate Product Listings with Gemini + Vision Agents

Key Concepts

Gemini 3.1 Flash Live Model: A low-latency, multimodal AI model capable of real-time interaction, vision processing, and complex instruction following.
Vision Agents SDK (Stream): A framework for building AI agents that interact with live video feeds and external tools.
Function Calling: The ability for an LLM to trigger predefined code functions (e.g., image processing, web search) based on user needs.
Instruction Following: The model's capacity to adhere to a strict, multi-step workflow defined in system prompts, even when prompted to deviate.
Video Processors: Components that analyze live video frames to provide real-time guidance or trigger events (like taking a screenshot).
Orchestration: The process of managing the agent's workflow, either through manual code logic or, as demonstrated, through structured system instructions.

1. Project Overview and Workflow

The project demonstrates an AI-powered assistant designed to help users create professional product listings for used items. The workflow involves:

Object Detection: Analyzing a live video feed to identify the item.
Image Capture & Polishing: Capturing a screenshot and using an image-generation tool to create a clean, professional background.
Product Research: Performing a web search to gather details about the item.
Listing Generation: Compiling the gathered information into a ready-to-use product description.

2. Technical Implementation

The application is built using Python, with the following key components:

Infrastructure: The Vision Agents SDK handles the connection to the Gemini model and manages the agent's lifecycle.
Agent Initialization: The create_agent function defines the LLM, the infrastructure edge (getstream.Edge), and the agent's persona.
Real-time Processing: A VideoProcessor class (specifically ObjectCaptureProcessor) subscribes to the user's video feed. It analyzes frames to determine if the object is clear enough to capture, providing guidance if the user needs to adjust the item's position.
Tool Integration: Custom tools are registered using the @llm.register_function decorator. This allows the agent to dynamically call:
- Image Polishing: Uses Gemini 3.1 Flash Image Preview to process image bytes.
- Product Search: Uses a google_search tool and validates the output against a JSON schema (ProductDetails) to ensure structured data.

3. Orchestration and Instruction Following

Instead of writing a rigid, manual Orchestrator class, the developers leveraged the Gemini model's advanced instruction-following capabilities.

Methodology: A markdown file defines the agent's persona and a strict, ordered list of steps.
Evidence of Capability: During a test, the user attempted to "jailbreak" the agent by asking it to skip the screenshot step and proceed directly to the description. The model successfully resisted, stating: "My purpose is to guide you through the steps in a specific order... Shall we do the screenshot now?"

4. Development Framework

Environment: Initialized using uv init.
Dependencies: Includes vision-agents SDK, google-genai package, and Next.js for the front-end.
Communication: Uses event-based WebSocket connections to notify the front-end of the agent's progress through the workflow.

5. Notable Quotes

"The reduced latency on the latest Gemini model really helps make this conversation flow feel very intuitive and natural." — Stefan Blos
"Due to its reliability on function calling and instruction following, we can see that it directly translates and follows the tasks that we define in the instructions." — Stefan Blos

6. Synthesis and Conclusion

The integration of the Gemini 3.1 Flash Live Model with the Vision Agents SDK significantly lowers the barrier to entry for building complex, real-world AI applications. By offloading the orchestration logic to the model's instruction-following capabilities, developers can create intuitive, step-by-step user experiences that handle complex tasks—such as image processing and web research—without needing to build extensive custom infrastructure. The result is a highly reliable, conversational agent that maintains context and adheres to defined workflows, even under adversarial user input.