Back to all videos

Power the future of robotics with Gemini

By Google for Developers

Share:

Key Concepts

Embodied Reasoning (ER): The integration of AI models with physical robots to enable spatial awareness, physics-based reasoning, and decision-making.
Vision Language Models (VLM): Architectures that map language and vision into the same space, allowing for "open vocabulary" object detection.
Vision-Language-Action (VLA) Models: Models that map camera pixels and natural language directly to motor control values.
Perceive, Plan, Actuate: The fundamental framework for robotic autonomy.
Semantic Grounding: The ability of a model to link abstract natural language concepts to precise physical coordinates in the real world.
Long Horizon Temporal Reasoning: The ability to analyze sequences of video frames to understand state changes over time.
Swiss Cheese Model of Safety: A multi-layered approach to safety that stacks semantic, physical, and operational safeguards.

1. The Perceive, Plan, Actuate Framework

Google DeepMind utilizes a three-stage pipeline to transition robots from static, pre-programmed machines to autonomous agents:

Perceive: Using Gemini Robotics ER 1.6, robots move beyond simple object classification to "embodied reasoning." Unlike traditional models (e.g., YOLO) that rely on closed-vocabulary datasets, ER models use open-vocabulary detection to identify objects based on state (e.g., "the tool that looks used") rather than predefined labels.
Plan: The model performs Task Orchestration by breaking high-level human prompts (e.g., "put the blue block in the orange bowl") into a sequence of executable functions. It also handles Micro-planning, generating specific trajectories and waypoints to avoid obstacles.
Actuate: This involves moving the hardware. For complex, unstructured environments, VLA models are used to stream camera frames directly into motor values, allowing for reactive, real-time control.

2. Advanced Capabilities and Methodologies

Physical Common Sense: The ER model understands the physics of a scene. It recognizes structural integrity and weight, preventing the robot from attempting impossible tasks (e.g., lifting a table bolted to the floor).
Code Execution for Vision: To overcome issues like image rotation or clutter, the model can generate and execute Python code to crop, rotate, or process images, significantly improving accuracy in tasks like reading analog gauges or identifying serial numbers on chips.
Temporal Reasoning: By leveraging the Gemini 3 Flash backbone, the system tokenizes video frames chronologically. This allows the robot to act as its own "temporal supervisor," confirming if a task (like securing an object with a gripper) was successful by analyzing the motion delta between frames.

3. Human-Robot Interaction (HRI)

Gemini Live API: Enables low-latency, bidirectional, natural language communication. This allows robots to act as interactive partners rather than just tools.
Function Calling: The bridge between conversation and action. The model uses visual and auditory context to trigger specific developer-defined functions, allowing for fluid, context-aware responses to human requests.

4. Development and Prototyping

AI Studio: A platform for rapid prototyping where developers can test prompts and visual perception without needing to re-flash hardware.
MuJoCo Integration: A simulation engine that allows developers to test logic in a virtual environment before deploying to physical hardware, supporting a "fail-fast, fail-safe" development strategy.
Embodiment-Agnostic Design: The models are designed to work across various hardware configurations, including humanoids, quadrupeds, and bipedal setups.

5. Safety Research

Safety is treated as a foundational requirement rather than an afterthought.

Asimov Safety Benchmarks: A research-backed framework that evaluates safety using:
- NEISS (National Electronic Injury Surveillance System): Real-world hospital injury data used to train the model on physical risks.
- ISO Standards: Industrial safety constraints to ensure compliance with factory-grade operational requirements.

6. Synthesis and Conclusion

The shift from traditional, brittle "if-else" scripting to AI-driven embodied reasoning represents a paradigm shift in robotics. By combining high-level semantic reasoning (ER) with real-time reactive control (VLA), Google DeepMind is enabling robots to navigate messy, unpredictable, and "unseen" environments. The core takeaway for developers is the move toward high-level goal definition, where the AI handles the complex intermediate steps of perception, planning, and safety, allowing humans to focus on the desired outcomes of their robotic systems.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video