Vision Models Can't Count. Here's the Fix.

Key Concepts

Vision Language Model (VLM): AI models (like Gemma 4) capable of processing and reasoning about images and text.
Image Segmentation: The process of partitioning an image into multiple segments or objects (e.g., Falcon Perception).
Agentic Loop: An autonomous system where an AI model plans, executes, and re-evaluates tasks using specific tools.
Grounding: The process of linking abstract concepts or language to specific locations or objects within an image.
Chain of Perception Decoding: A technique used by the Falcon Perception model to process text and image inputs simultaneously to generate binary masks.
Occlusion: When objects are partially hidden or blocked by other objects in an image, making them difficult for standard models to count.

1. The Problem: Limitations of Standalone VLMs

While Vision Language Models like Google’s Gemma 4 are excellent at scene understanding and general reasoning, they struggle with:

Counting: They often fail to accurately count objects, especially in complex or crowded scenes.
Spatial Grounding: They cannot pinpoint the exact location or boundaries of specific objects.
Occlusion: They struggle to distinguish between overlapping objects.
Reasoning Errors: As demonstrated in the fruit-counting example, a VLM might hallucinate or miscount when asked to compare quantities (e.g., "Are there more oranges than apples?").

2. The Solution: The "Gemma Vision Agent" Pipeline

The proposed solution is an agentic pipeline that combines the reasoning power of Gemma 4 with the precise segmentation capabilities of the Falcon Perception model.

Technical Specifications:

Gemma 4 (E4B Instruction-following): A multimodal reasoning model, efficient enough for local inference.
Falcon Perception: A 0.3 billion (300 million) parameter image segmentation model from the Technology Innovation Institute (TII). It is highly efficient and capable of generating full-resolution binary masks.

3. The Agentic Loop Methodology

The system operates through a structured, multi-step process:

Planning & Routing: The Gemma 4 model acts as a "router." It analyzes the user query to determine if segmentation is required.
Tool Execution: If segmentation is needed, the agent calls the Falcon Perception model to isolate specific objects (e.g., "detect all apples").
Visual Analysis: The segmentation model returns annotated images or bounding boxes.
Reasoning: Gemma 4 processes the segmented data to provide a final, accurate answer.
Re-evaluation: If the initial plan fails, the agent can loop back to re-evaluate the scene (limited to 8 steps for safety).

4. Real-World Applications & Examples

Object Counting & Comparison: The system successfully differentiates between apples and oranges in a cluttered image, whereas a standalone VLM fails.
Complex Scene Analysis: In a street scene, the agent can identify and count cars versus people, even when some individuals are partially occluded.
Breed Identification: The agent can segment dogs in an image and then use the VLM to identify their specific breeds.

5. Notable Quotes

"If you combine this with a segmentation model and wrap it around an agentic loop, then you can use the agent to first segment oranges, then apples, count them, and then this model can reason around the items that it has found."
"Falcon Perception... is only about 0.3 billion or 300 million parameters. So, you can actually use this for local inference."

6. Implementation Details

Hardware: The system is designed for local execution. The author provides support for Nvidia GPUs (DGX Spark) and Apple Silicon (via MLX).
Integration: The pipeline uses a "Plan Router" to decide between a simple sequential path (for easy queries) and a complex agentic loop (for open-ended, multi-step queries).
Speech Integration: The demo incorporates the Parakeet model (Nvidia) for real-time speech-to-text transcription, allowing for voice-based interaction with the vision agent.

7. Synthesis and Conclusion

The "Gemma Vision Agent" demonstrates that combining specialized, lightweight models (Falcon Perception) with powerful reasoning models (Gemma 4) creates a robust, local-first system that overcomes the inherent counting and grounding weaknesses of standalone VLMs. By utilizing an agentic loop, the system moves beyond simple image classification into complex, multi-step visual reasoning. Future iterations are expected to explore real-time video processing and object tracking.