Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 3: Policy Gradients

Reinforcement Learning: Policy Gradients & Online Algorithms

Key Concepts:

Reinforcement Learning (RL): Learning optimal behavior through trial and error to maximize cumulative rewards.
Policy (πθ): A function defining the agent’s behavior, parameterized by θ.
Reward Function (r): A signal indicating the desirability of a state-action pair.
Trajectory (τ): A sequence of states and actions experienced by the agent.
Policy Gradients: A class of RL algorithms that directly optimize the policy parameters.
Online Algorithm: An algorithm that learns and updates its policy using new data collected from its current policy.
Offline Algorithm: An algorithm that learns from a fixed dataset without interacting with the environment.
Imitation Learning: Learning a policy by mimicking expert demonstrations.
Baseline: A technique to reduce variance in policy gradient estimates by subtracting a constant from the reward.
Importance Sampling: A method for estimating expectations under one distribution using samples from another.
On-Policy: An algorithm that requires data collected from the current policy for updates.
Off-Policy: An algorithm that can learn from data collected from different policies.

1. Introduction & Recap

The lecture builds upon previous discussions of reinforcement learning, framing it as maximizing the expected sum of rewards. The core components – states, actions, trajectories, reward functions, and policies – were revisited. The goal is to move beyond imitation learning (which is limited by expert performance) to algorithms that can improve through practice via online reinforcement learning. Policy gradients are introduced as the foundation for many advanced RL algorithms, including those used in robotics and language modeling.

2. Online Reinforcement Learning & Policy Gradient Overview

Online RL involves a cyclical process: initializing a policy, collecting data by running the policy in the environment, and then using that data to improve the policy. This process is repeated iteratively. The first online RL algorithm discussed is policy gradients. The key goals are to understand the intuition, implementation, and appropriate use cases for policy gradients.

3. Mathematical Formulation & Gradient Ascent

The objective is to maximize the expected rewards, mathematically expressed as an expectation over the probability distribution of trajectories, which is a function of the policy (πθ), environment dynamics, and reward function. Directly evaluating this expectation is intractable, so it's estimated by running the policy multiple times and averaging the observed rewards.

To improve the policy, gradient ascent is used. The challenge lies in calculating the gradient of the objective function with respect to the policy parameters (θ). The initial formulation involves an integral, which is then manipulated using a "log trick" (gradient of log(x) = 1/x) to transform it into an expectation that can be estimated with samples. This leads to the core policy gradient formula:

∇θ J(θ) = Eτ~πθ [ ∇θ log πθ(τ) * r(τ) ]

Where:

∇θ J(θ): The gradient of the objective function with respect to the policy parameters.
Eτ~πθ: The expectation over trajectories sampled from the policy πθ.
∇θ log πθ(τ): The gradient of the log probability of the trajectory under the policy.
r(τ): The sum of rewards along the trajectory.

4. Implementation & Intuition

The policy gradient can be implemented by collecting trajectories, computing the gradient of the log probability of actions taken in those trajectories, and then summing these gradients weighted by the trajectory rewards. This is often done using automatic differentiation libraries like PyTorch.

The intuition behind the gradient is that it increases the likelihood of actions that led to high rewards and decreases the likelihood of actions that led to low rewards. The first term in the gradient resembles the imitation learning objective, while the second term weights it by the reward, effectively imitating successful behavior.

5. Baseline for Variance Reduction

A significant issue with the basic policy gradient is its high variance. To address this, a baseline is subtracted from the reward. A common baseline is the average reward. Mathematically, this doesn't change the expected gradient (as the expectation of a constant is zero), but it reduces the variance of the gradient estimate, leading to more stable learning.

6. Off-Policy Policy Gradients & Importance Sampling

The standard policy gradient is "on-policy," meaning it requires new data collected from the current policy for each update. This is inefficient. To address this, importance sampling is introduced. This allows the algorithm to reuse data collected from previous policies (π) to estimate the gradient for the current policy (πθ).

The key idea is to weight the rewards from the old policy by the ratio of the new policy's probability to the old policy's probability:

∇θ J(θ) = Eτ~π [ ∇θ log πθ(τ) * (πθ(τ) / π(τ)) * r(τ) ]

However, this ratio can become unstable if the policies are very different, especially over long trajectories. A practical solution is to consider expectations over individual time steps rather than entire trajectories.

7. Algorithm Summary & Considerations

Online Policy Gradient Algorithm: Initialize policy, collect data, estimate gradient, update policy, repeat.
Off-Policy Policy Gradient Algorithm: Same as above, but can reuse data from previous policies using importance sampling.
Key Challenges: High variance, sensitivity to reward scaling, and the need for dense rewards.
Future Directions: Actor-critic methods (to be discussed in the next lecture) and techniques like Proximal Policy Optimization (PPO) to address these challenges.

Notable Quotes:

"You can think of this as a formalization of trial and error where you're actually going to try to do different things and if you see good stuff happening then you'll try to do more of that and if you make a mistake then you'll try to stop doing that."
"Policy gradient is still noisy and is best with large batches and dense rewards."

Technical Terms:

Stochastic: Involving randomness.
Trajectory: A sequence of states and actions.
Surrogate Objective: An alternative objective function that has the same gradient as the original.
Autodiff: Automatic differentiation, a technique for computing gradients of complex functions.
Marovian State: A state that contains all the information necessary to predict the future.

Logical Connections:

The lecture progresses logically from a recap of RL fundamentals to the derivation and implementation of policy gradients. The introduction of the baseline and importance sampling are presented as solutions to specific problems encountered with the basic policy gradient algorithm (high variance and on-policy learning). The discussion consistently connects the mathematical formulation to the intuitive understanding of the algorithm.

Data & Research Findings:

While no specific research findings were presented, the lecture highlighted the practical application of policy gradients in areas like legged robotics and language modeling. The discussion of variance reduction and importance sampling reflects established techniques in the RL literature.

Conclusion:

This lecture provided a comprehensive introduction to policy gradients, a foundational algorithm in online reinforcement learning. The derivation of the gradient, the discussion of variance reduction techniques, and the introduction of importance sampling provide a solid understanding of the core concepts and challenges associated with this approach. The lecture sets the stage for future discussions of more advanced RL algorithms like actor-critic methods and PPO.

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 3: Policy Gradients

Reinforcement Learning: Policy Gradients & Online Algorithms

Chat with this Video

Related Videos

Ready to summarize another video?