Reinforcement Learning for Agents - Will Brown, ML Researcher at Morgan Stanley

Key Concepts

Reinforcement Learning (RL) for Agents
Agents vs. Pipelines/Workflows
Exploration and Exploitation in RL
GRPO Algorithm
Rubric Engineering
Multi-Step Environments
AI Engineering in the RL Era

Introduction

Will Brown from Morgan Stanley discusses the potential of reinforcement learning (RL) for developing more autonomous and capable agents. He contrasts current LLM applications, which primarily function as chatbots and reasoners, with the vision of true agents capable of taking actions and achieving complex goals. The talk explores the limitations of current approaches, the promise of RL, and the challenges and opportunities in building an ecosystem for agentic RL.

Current State of LLMs and Agents

LLMs as Chatbots and Reasoners: Current LLMs excel at question answering and interactive problem-solving, aligning with OpenAI's levels framework.
Agents vs. Pipelines: A distinction is made between agents, which possess a higher degree of autonomy, and pipelines (workflows), which require significant manual engineering to define decision trees and prompt refinements.
Limited Autonomy: Most current "agents" are essentially pipelines with tight feedback loops, requiring frequent user interaction. Truly autonomous agents capable of extended independent operation (e.g., Devon, Operator, OpenAI's Deep Research) are still relatively rare.

The Promise of Reinforcement Learning

Traditional RL Definition: An agent interacts with an environment with a goal, learning to improve its performance over time through repeated interaction and feedback.
Addressing Performance Gaps: RL offers a potential solution for improving agent performance beyond the limitations of prompt tuning, particularly when models struggle to achieve desired success rates.
DeepSeek's R1 Model: The DeepSeek R1 paper demonstrated the power of RL in training models for complex reasoning tasks. The model learned long chains of thought as an emergent strategy through reinforcement learning, rather than being explicitly programmed.
Exploration and Exploitation: The core principle of RL involves exploring different actions and exploiting those that yield positive rewards.

GRPO Algorithm Explained

Conceptual Simplicity: The GRPO algorithm, used by DeepSeek, is presented as a conceptually simple approach to RL.
Process: For a given prompt, the model generates multiple completions, scores each completion, and then adjusts its behavior to favor those with higher scores.
Single-Turn Focus: GRPO, in its initial application, is primarily used in single-turn reasoning scenarios, not yet fully extended to multi-step agentic environments.

Challenges and Opportunities in Agentic RL

Extending RL to Autonomous Systems: The key challenge lies in extending RL techniques to create more powerful, agentic, and autonomous systems.
OpenAI's Deep Research: OpenAI's Deep Research project, which utilizes end-to-end reinforcement learning for complex research tasks, demonstrates the potential of RL for achieving greater autonomy.
Limitations of Current RL Agents: Current RL-powered agents, while impressive, are not yet capable of handling all tasks or solving all types of problems. They excel at specific skills but may struggle with out-of-distribution tasks or highly manual calculations.
Infrastructure Gaps: The infrastructure for building and training RL agents is still nascent. There is a need for better API services, open-source tools, and best practices.
Unknowns: Key questions remain regarding the cost of RL training, the required model size, the generalizability of learned skills, and the design of effective rewards and environments.

Rubric Engineering: A Key Element

Inspiration from R1 Replication: A personal project involving the replication of the R1 model using a small language model and the GRPO algorithm sparked significant interest due to its simplicity and ease of modification.
Definition: Rubric engineering involves designing reward functions that guide the model's learning process.
Beyond Simple Evaluation: Rubrics can include rewards for specific behaviors, such as following a particular format or using the correct language, in addition to the ultimate task outcome.
Creative Exploration: There is significant opportunity for creativity in designing rubrics, including using LLMs to design them, autotuning them, and incorporating LM judges into the scoring system.
Reward Hacking: Caution is advised regarding reward hacking, where models may find unintended ways to maximize their reward without actually learning the desired task.

Open-Source Framework for Multi-Step Environments

Leveraging Existing Agent Frameworks: The goal is to create a framework that allows developers to integrate RL into existing agent frameworks and environments.
Simplified Interaction Protocol: The framework allows developers to define an interaction protocol between the model and the environment, without needing to manage the underlying model weights or tokens.
Automated Training: Once the environment is defined, the framework automates the RL training process, allowing the model to learn and improve over time based on the defined rewards.

AI Engineering in the RL Era

Skills vs. Knowledge: It is difficult to instill a skill in a model through prompting alone. RL allows models to acquire skills through trial and error.
Fine-Tuning Remains Relevant: Fine-tuning, particularly with RL, remains an important technique for improving model performance.
Transferable Skills: The skills and knowledge gained from traditional AI engineering (e.g., prompt engineering, evaluation) are directly applicable to building RL-powered agents.
Evals and Prompts vs. Environments and Rubrics: The challenge of building environments and rubrics for RL is similar to the challenge of building evals and prompts for traditional LLM applications.
Continued Importance of Existing Tools: Good monitoring tools and a robust ecosystem of companies and platforms will be essential for supporting the development of RL-powered agents.

Conclusion

The talk concludes by emphasizing that while the integration of RL into AI engineering is still in its early stages, it holds significant promise for unlocking true autonomous agents. The skills and tools developed in traditional AI engineering will be crucial for navigating the challenges and opportunities of this emerging field. The future may involve a greater emphasis on reinforcement learning to achieve truly autonomous agents and organizations powered by language models.