Stanford Robotics Seminar ENGR319 | Spring 2026 | Interactive Autonomy
By Stanford Online
Key Concepts
- Joint Prediction and Planning: The necessity for robots to model and anticipate the reactions of other agents (humans or robots) to their own decisions.
- Potential Games: A class of games where the equilibrium can be found by minimizing a single "potential function" rather than solving coupled optimal control problems.
- Quantal Response Equilibrium (QRE): A game-theoretic model that accounts for human noise and suboptimality by assuming agents maintain probability distributions over actions.
- Entropic Cost Equilibrium: An extension of the maximum entropy principle to multi-agent settings, allowing for the modeling of bounded rationality.
- Diffusion Policies: Generative models used to capture multimodal interaction behaviors (e.g., choosing between multiple valid ways to coordinate).
- LLM/VLM Coaching: Using Large Language Models or Vision-Language Models as "coaches" to provide curricula, reward shaping, and credit assignment in reinforcement learning (RL) pipelines.
- Credit Assignment: The challenge of determining which agent in a multi-agent system contributed to a team's success or failure.
1. Multi-Agent Interaction and Game Theory
The speaker highlights that multi-agent interaction is inherently difficult because agents are interdependent. A robot cannot optimize its path in isolation; it must account for how others will react.
- Mathematical Formalization: The lab models these interactions as dynamic games. The goal is to reach a Nash Equilibrium, where no agent has an incentive to deviate from their strategy given the others' actions.
- Potential Games: To overcome the computational complexity of solving coupled nonlinear optimal control problems, the lab leverages the structure of potential games. By reducing the multi-agent problem to a single-agent optimization of a potential function (sum of tracking costs + pairwise collision costs), they achieved a 20x speedup in computation.
2. Handling Multimodality and Coordination
Interactions are often multimodal (e.g., two people in a hallway can pass left-left or right-right).
- Equilibrium Selection: The speaker notes that simply finding one equilibrium is insufficient; robots must adapt to the specific convention (e.g., yielding left vs. right) being used by humans.
- Diffusion Policies: To handle multiple modes of interaction without a central coordinator, the lab uses decentralized diffusion policies. This allows robots to implicitly coordinate by learning from datasets containing various interaction modes, resulting in successful collaborative tasks like two robots carrying a rigid object.
3. Learning from Demonstrations (IRL)
The speaker argues that learning from a human in isolation is insufficient for understanding social interaction.
- Inverse Reinforcement Learning (IRL): The lab uses IRL to infer the underlying cost functions of humans. By observing interactions rather than isolated behaviors, they gain better insights into human objectives.
- Entropic Cost Equilibrium: By incorporating a weight parameter ($\beta$) for rationality, they can model human suboptimality and noise, leading to significantly higher prediction accuracy for pedestrian motion compared to standard imitation learning.
4. Coaching Robots with Foundation Models
A major portion of the talk focuses on using LLMs/VLMs as "coaches" to improve Multi-Agent Reinforcement Learning (MARL).
- Curriculum Generation: LLMs break down complex tasks (e.g., humanoid running) into a sequence of simpler subtasks, which is more effective than one-shot reward shaping.
- LLM/VLM Critics: The lab uses LLMs as critics to solve the credit assignment problem. By providing the LLM with training curves or visual representations of trajectories, the model identifies which agent contributed to a success or failure, allowing for much faster and more stable learning than traditional algorithms like QMIX or MAPPO.
- Implementation: These models are used during the training phase (zero-shot or few-shot) to provide feedback, not during real-time execution, thus avoiding high inference latency.
5. Perception and Continual Learning
- Distributed Optimization: The lab applies ADMM (Alternating Direction Method of Multipliers) and consensus algorithms to allow multiple robots to map environments collaboratively using modern techniques like NeRFs and Gaussian Splats.
- Continual Learning: The speaker frames continual learning as a multi-agent consensus problem where the "past self" and "current self" must reach an agreement on model parameters. This prevents "catastrophic forgetting" while allowing the robot to adapt to changes in the environment (e.g., moving furniture).
Synthesis and Conclusion
The core takeaway is that multi-agent coordination is a structural problem that can be simplified by leveraging game-theoretic properties (potential games) and modern foundation models. By treating LLMs as coaches that provide curricula and credit assignment, the lab has successfully enabled robots to perform complex, collaborative tasks that were previously considered "hopeless" with standard RL. The speaker emphasizes that while these methods are highly effective for high-level decision-making, they are not yet intended for high-precision, low-latency control tasks. Future work will continue to explore the balance between human-in-the-loop data collection and automated coaching.
Chat with this Video
AI-PoweredLoad the transcript when you're ready to chat so the initial page stays lighter.