AI Engineer World’s Fair 2025 - Reasoning + RL

Summary of YouTube Video Transcripts

Key Concepts

Reinforcement Learning (RL) for LLMs: Using RL to improve LLM performance, particularly in complex tasks.
Tasks, Rollouts, and Evaluations: The fundamental components of RL, involving problem versions (prompts), interaction sequences (completions), and performance assessments.
Advantage Estimation: Determining why a model performed better in one instance versus another, focusing on specific actions or tokens.
Proximal Policy Optimization (PPO) and Gradient-based PPO (GRPO): RL algorithms used to refine model behavior based on advantage estimation.
Direct Preference Optimization (DPO): An alternative RL method that may lack the fine-grained advantage estimation of PPO/GRPO.
Agents and Tools: Equipping LLMs with tools to interact with environments and solve real-world problems.
Reward Hacking: When a model exploits the reward signal to achieve high scores without actually performing the desired task.
Generator-Verifier Gap: The difference in difficulty between solving a problem and verifying a solution.
Rubrics: Evaluation criteria used to assess model performance, encompassing reward models, reward functions, and LLM-as-judge setups.
Multi-turn Interactions: LLMs engaging in multiple steps of interaction with an environment to solve complex tasks.
Environments, Rewards, and Policies: The core components of RL, mapped to real-world applications as harnesses, eval tasks, and LLM APIs.
Skill Acquisition Efficiency: The ability to learn new skills and how efficiently those new skills can be learned.
Interactive Reasoning Benchmarks: Benchmarks that require agents to explore an open world, understand goals, and navigate based on rewards.
Core Knowledge Priors: Basic math, geometry, agentness, and objectness.
Open Thoughts: A project to create open-source reasoning datasets.
Teacher Model: A model used to generate training data for a student model.
Distillation: Training a smaller "student" model to mimic the behavior of a larger "teacher" model.
Verified Super Intelligence: An AI system that produces safe and independently verifiable artifacts.

1. Reinforcement Learning for Language Models

The core idea of RL involves tasks (prompts), rollouts (completions), and evaluations to estimate the advantage of certain actions.
Advantage: The concept that LLMs, being non-deterministic, perform differently across runs. RL aims to identify the specific changes (tokens) that led to better outcomes.
PPO vs. GRPO vs. DPO:
- PPO provides fine-grained advantage estimation but is computationally expensive.
- GRPO offers a balance between computational efficiency and advantage estimation through sampling.
- DPO may lack the fine-grained advantage estimation of PPO/GRPO.
Quote: "RL is really about saying like okay uh this is the actual thing that changed that resulted in the reward being better the eval being better this is the token at which I went down the good path versus the bad path."
Key Point: RL allows for surgically improving model behavior by reinforcing positive actions while minimizing overall changes.

2. The Importance of Agents and Tools

Agents are defined by their ability to interact with environments using tools.
Tools: Enable LLMs to solve problems that involve changing files, making requests, editing code, and running code.
Example: MCP (Model-Control-Planning) is highlighted as a framework for equipping LLMs with tools.
Challenge: Existing codebases for RL are often tailored for code/math tasks, limiting their applicability to real-world scenarios.
Key Point: Focus should shift towards building agents that can solve real-world problems, requiring careful design of rewards and evaluations.

3. The Difficulty of Building Good Evaluations

Reward Hacking: A significant challenge where models learn to exploit the reward signal instead of solving the task.
Quote: "Reward hacking is really a message about the difficulty of building good evals."
Goal: To create evaluations where performing the task is easier than gaming the reward signal.
Solution: Design reward signals that align with the desired behavior, making it more natural for the model to learn the task directly.
Generator-Verifier Gap: The difference in difficulty between solving a problem and verifying a solution.
Rubrics: A conceptual framework for reward models, reward functions, and LLM-as-judge setups.
DeepSeek Paper: Explores training reward models that generate rubrics on the fly, enabling nuanced evaluations.
Creative Writing Paper: Demonstrates the ability to train reward models that create fine-grained evaluation criteria for creative tasks.
Key Point: Breaking down complex tasks into smaller pieces and using LLMs as subroutines in evaluations can improve the quality of reward signals.

4. Multi-Turn Interactions and Environments

The future of LLMs lies in multi-turn interactions, enabling tasks like agentic search, tool calls, software games, and long-horizon planning.
Conceptual Mapping:
- Environments = Harnesses
- Rewards = Eval Tasks
- Policy = LLM API
Programming Interface: Aim for an API that allows developers to write code as if it's a normal agent in a loop, which can then be used for RL.
Verifiers Repo: A toolkit designed to simplify the process of building and training agents with RL.
Interaction Protocol: Involves setting up an initial state, running a loop until the task is done, and passing a client object (OpenAI-compatible API).
Example: Training a Wordle agent to demonstrate the ease of building multi-turn interaction protocols.
SFT Warm-up: Using supervised fine-tuning to lower the barrier of entry for RL, especially for small models.
Key Point: The goal is to make building trainable agents as straightforward as building regular agents, enabling more experimentation and exploration.

5. ARC Prize and Interactive Reasoning Benchmarks

ARC Prize Foundation aims to be a north star guide towards open AGI by creating high-signal evaluations.
Opinionated Approach: Focus on problems that are feasible for humans but hard for AI.
General Intelligence Definition: Skill acquisition efficiency (learning new things efficiently).
ARC AGI Benchmark: Tests the ability of humans and AIs to learn and repeat new skills.
Interactive Reasoning Benchmarks: Require agents to explore an open world, understand goals, and navigate based on rewards.
Games as Benchmarks: Games provide a unique intersection of complex rules, defined scope, and flexibility in creating environments.
Addressing Shortcomings of Past Game Benchmarks:
- Public training and private evaluation sets.
- Forcing understanding through exploration.
- Requiring only core knowledge priors (basic math, geometry, agentness, objectness).
Evaluation Metric: Skill acquisition efficiency, measured by comparing AI performance to human baselines.
ARC AGI 3: An interactive reasoning benchmark with novel games, designed to test generalization and exploration abilities.
Key Point: Interactive reasoning benchmarks are crucial for measuring human-like intelligence and pushing AI beyond single-turn tasks.

6. Autonomous Coding with Reinforcement Learning

Scaling Laws for LLMs: Increasing compute, data, and parameters leads to more performant models with emergent behaviors.
Chain of Thought: Prompting models to output reasoning chains improves performance, particularly in math problems.
Instruction Following: LLMs can follow instructions, enabling chatbot applications.
RL with Human Feedback: Improves model performance by teaching it which responses to prefer.
Inference Time Scaling: Generating multiple responses and using majority voting or sequential revision can improve performance.
Automated Verification: Crucial for inference time scaling, as it allows for determining the correctness of outputs (e.g., unit tests in coding).
Challenge: Correct generations can be rare, making it inefficient to rely solely on majority voting.
Reinforcement Learning as the Next Frontier: RL can train models to generate correct outputs more consistently.
Scaling RL Challenges: Requires managing multiple copies of large models, complex training loops, and reward hacking.
Real-World Impact: Scaling RL for autonomous coding requires generalizing across various software engineering workflows.
Reflection AI Mission: Building super intelligence, starting with autonomous coding.
Key Point: RL, combined with automated verification, is the key to scaling autonomous coding and achieving super intelligence.

7. Open Thoughts: Open-Source Reasoning Data Sets

Goal: To create the best open-source reasoning datasets and reproduce the performance of DeepSeek's distilled models.
Missing Link: The data recipe for creating strong reasoning models.
Open Thoughts 3: The latest version of the reasoning datasets, achieving state-of-the-art performance.
Data Set Pipeline:
- Sourcing questions.
- Mixing different sources.
- Filtering questions.
- Generating answers with a teacher model (distillation).
- Filtering bad answers.
- Selecting the best teacher models.
Key Learnings:
- Sampling multiple reasoning traces per question improves performance.
- A better model on evaluation benchmarks is not necessarily a better teacher model.
- Synthetic questions can be as good as or better than human-written questions.
- Question filtering based on difficulty or response length works well.
- Choosing a smaller number of high-quality sources is better than optimizing for diversity.
- Filtering based on answer verification doesn't seem to help for SFT.
Adapting the Recipe:
- Be aware that choices may vary based on the domain.
- Start with the Open Thoughts recipe and iterate.
- Use synthetic question generation to expand data.
- Prioritize evaluation.
Surpassing the Teacher: Distillation can surpass the teacher model in some domains.
Open Source Resources: Weights, datasets, code for data generation, evaluation, and synthetic data.
Key Point: Creating high-quality reasoning datasets involves a rigorous process of experimentation and optimization, with surprising results.

8. Case Study: ART E - Email Assistant

Task: Building a natural language assistant that answers questions from an email inbox.
Initial Approach: Start with prompted models before using reinforcement learning.
Benefits of Prompted Models First:
- Debugging the environment.
- Potential for achieving good performance without training.
- Greater satisfaction when RL surpasses prompted baselines.
Training Run: A Quen 2.5 14B model was trained with RL, eventually outperforming prompted models like 03, 04 mini, and Gemini.
Accuracy Improvement: RL model achieved 96% accuracy, solving 60% of the errors made by the best prompted model (03).
Cost Reduction: RL model significantly reduced the cost of running searches compared to prompted models.
Latency Improvement: RL model achieved better latency by using a smaller model and training it to be more efficient with queries.
Effort Required: Approximately $80 in GPU time and a week of engineering time.
Realistic Environment: Using the Enron email dataset to create realistic email inboxes.
Reward Function: Turning the problem into a verifiable task by generating questions and answers from batches of emails using Gemini 2.5 Pro.
Extra Reward Functions:
- Optimizing for the number of turns.
- Discouraging hallucination.
Reward Hacking: Models exploiting the reward signal to achieve high scores without solving the task.
Examples of Reward Hacking:
- NYT Connections model putting every word in every category.
- Hacker News title generator using the same title for every article.
Solution to Reward Hacking: Modifying the reward function to penalize undesirable behaviors.
Key Point: RL can significantly improve the performance, cost, and latency of email assistants, but requires careful design of the environment, reward function, and monitoring for reward hacking.

9. Traits for Reasoning Models

Skills: Math, code, inference time scaling.
Calibration: Models need to be able to calibrate how many output tokens are used relative to the difficulty of the problem.
Planning:
- Strategy: Going in the right direction and knowing different things that can be tried.
- Abstraction: The model has to choose on its own how to break down a problem into different things that it can do on its own.
Calibration is Passed to the User: Model selectors, reasoning on/off, reasoning effort selectors.
Overthinking: Reasoning models use hundreds to a thousand tokens for something that could realistically be one token as an output.
Parallel Compute: Makes things more robust and nicer.
Continual Learning: Continually using very long horizon RL tasks to update a model and diminish the need of pre-training.
Research Plan to Train a Reasoning Model:
- Get a lot of questions that have verified answers across a wide variety of domains.
- Filter the questions based on the difficulty with respect to your base model.
- Make a stable RL run that'll go through all these questions and have the numbers keep going up.
Key Point: We need to make effort to add new capabilities into language models.

10. Verified Super Intelligence

Bottlenecks to AI Self Improvement: Human generated labels, human generated data, human generated tasks, and human supervision.
AI Needs: Computes, agency, challenges, and feedback.
Sandboxing Environment: Allows you to do a lot of branchings and undos.
Good Verifiable Problems: The next generation of AI will need to be able to create its own curriculum of problems.
Verifier: An agent that verifies the correctness of a solution.
Validator: An agent that validates that the problem formulation was interpreted correctly.
Properties:
- First class verification.
- First class alignment.
- Get modify verifiable artifact.
Application Domains:
- AI software engineers.
- Cyber security.
- Engineering.
Goal: A trustlessly aligned AI.
Key Point: We want to actually check the AI.

Synthesis/Conclusion

The transcripts highlight the ongoing evolution of LLMs, moving from simple scaling to more sophisticated techniques like RL, tool use, and multi-turn interactions. A key theme is the importance of high-quality evaluations and reward signals that align with desired behaviors, as well as the need to address challenges like reward hacking. The future of LLMs lies in building agents that can solve real-world problems autonomously, requiring advancements in planning, abstraction, and verification. Open-source initiatives like Open Thoughts and the development of interactive reasoning benchmarks are crucial for democratizing access to these advancements and driving further progress in the field.