Back to all videos

Stanford Robotics Seminar ENGR319 | Spring 2026 | Integrated Learning and Planning

By Stanford Online

Constraint Optimization World Models Diffusion Models Compositionality

Share:

Key Concepts

Neuro-symbolic AI: A paradigm combining neural networks (for perception, feature extraction, and probabilistic modeling) with symbolic reasoning (for planning, constraint satisfaction, and logical structure).
Constraint Optimization: A framework where tasks are defined as satisfying a set of geometric, physical, and task-specific constraints rather than direct policy mapping.
World Models: Internal representations that allow robots to simulate the outcomes of actions before execution.
Diffusion Models: Generative models used here to predict trajectories and object poses by learning energy landscapes.
Compositionality: The ability to stitch together individual skills or constraints to solve complex, long-horizon tasks.
One-Shot/Few-Shot Learning: The capability to learn new capabilities from 1–10 demonstrations.
Retriever: A programming model for closed-loop robot agents that supports asynchronous execution and time-explicit typing.

1. The Paradigm Shift: From Data Fitting to Neuro-Symbolic Planning

The speaker argues that current robotics approaches—which treat intelligence as "fitting functions to data"—suffer from low data efficiency and limited generalization. While companies like Physical Intelligence achieve impressive results, they require massive datasets for simple tasks. In contrast, humans generalize from single examples by using internal world models and planning. The proposed solution is to integrate neural representations (for perception and guidance) with physical models (for stability and dynamics) within a constraint optimization framework.

2. Methodology: Constraint-Based Manipulation

Instead of training an end-to-end policy, the speaker decomposes tasks into constraints:

Rigid Body Dynamics: Handled by physical simulators.
Geometric Constraints: Handled by motion planners.
Task-Relevant Constraints: Learned from data (e.g., grasping poses, contact points).

Step-by-Step Process for One-Shot Learning:

Visual Correspondence: Use pre-trained features (e.g., DINOv2) to identify functional points on a target object based on a single demonstration.
Constraint Formulation: Define the task as a set of contacts (e.g., "mug touches tree").
Verification: Use model-based planning to verify the stability of these contacts, filtering out noisy neural predictions.
Execution: Generate trajectories that satisfy the identified constraints.

3. Spatial Reasoning and Compositional Diffusion

To handle complex tasks like table setting, the system uses Compositional Diffusion Models:

Language-to-Graph: A Vision-Language Model (VLM) parses instructions into a spatial relationship graph (e.g., "apple is left of plate").
Energy-Based Diffusion: Each relationship (e.g., "left of") is associated with a dedicated diffusion model that acts as an energy function.
Gradient Composition: At inference time, the system adds the gradients of these energy fields to find an optimal configuration that satisfies all spatial constraints simultaneously.

4. Long-Horizon Planning

For tasks requiring multiple steps (e.g., washing and sorting dishes), the system employs:

Task Skeletons: Using VLMs to segment trajectories into high-level action sequences.
Transition Models: Predicting the future state of objects after an action to check for collisions or feasibility.
Search-Based Planning: If a proposed trajectory leads to a collision (e.g., a book blocking another), the system backtracks and samples a different trajectory from the diffusion model.

5. Notable Quotes and Perspectives

On Generalization: "Humans can learn from one example and then generalize reliably to different states and different goals."
On Neuro-Symbolic Integration: "The symbolic part is about that graphical structure of that constraint graph... each individual edge is associated with a neural network."
On System Engineering: "Robotics is not just a simple machine learning problem... any practical robot system should be a compositional system."

6. Real-World Applications and Research Findings

Alphabetical Shapes: The model achieved >90% success in hanging 3D-printed letters on a tree, outperforming standard policy learning which failed on unseen geometries.
Table Setting: The system successfully coordinated two robot arms to set tables based on diverse language instructions, respecting workspace constraints.
Retriever Framework: An open-source programming model that enables asynchronous, closed-loop execution, allowing robots to adjust plans in real-time (e.g., searching drawers for an object).

Synthesis and Conclusion

The speaker concludes that while foundation models are powerful, they are not a panacea for robotics. The future of general-purpose physical intelligence lies in neuro-symbolic systems that provide a "lens" into the task, allowing for modularity, interpretability, and data efficiency. By combining the generative power of neural networks with the logical rigor of planning and constraint satisfaction, robots can move beyond simple data-fitting toward systems that can reason, plan, and continuously improve through self-exploration.

Chat with this Video

AI-Powered

Load the transcript when you're ready to chat so the initial page stays lighter.

Related Videos

Ready to summarize another video?

Summarize YouTube Video