Stanford CS221 | Autumn 2025 | Lecture 9: Policy Gradient

Key Concepts

Reinforcement Learning (RL) Setting: An agent interacts with an environment (defined by an MDP) to maximize cumulative discounted rewards.
Value Functions: $V^\pi(s)$ (expected utility of policy $\pi$ from state $s$), $Q^\pi(s, a)$ (expected utility of taking action $a$ then following $\pi$), and their optimal counterparts ($V^, Q^$).
Model-Based vs. Model-Free: Model-based methods estimate the MDP (transitions/rewards) first; model-free methods estimate values or policies directly.
On-Policy vs. Off-Policy: On-policy (e.g., SARSA) estimates the value of the policy being followed; off-policy (e.g., Q-learning) estimates the optimal policy regardless of the exploration policy.
Bootstrapping: Updating estimates based on other estimates (e.g., $Q(s, a) \leftarrow r + \gamma Q(s', a')$) rather than waiting for the full rollout.
Function Approximation: Using parameterized models (e.g., Neural Networks) to estimate $Q$ or $\pi$ when state spaces are too large for tabular lookup.
Policy Gradient (REINFORCE): Directly optimizing policy parameters $\theta$ to maximize expected utility using the gradient of the log-probability of actions weighted by trajectory utility.
Variance Reduction: Techniques like baselines ($B(s)$) to reduce the noise in gradient estimates without introducing bias.
Actor-Critic: Hybrid methods that use an "actor" (policy) and a "critic" (value function) to balance bias and variance.

1. Reinforcement Learning Framework

The agent interacts with an environment via get_action and incorporate_feedback. The environment is a Markov Decision Process (MDP). The agent does not see the MDP directly; it only observes state-action-reward sequences. The goal is to maximize the expected utility, defined as the discounted sum of rewards over a rollout.

2. Value-Based Methods

Tabular Setting: Uses a lookup table for every state-action pair.
Function Approximation: Necessary for high-dimensional inputs (images, text). The Q-function is parameterized as $Q_\theta(s, a)$.
Methodology:
1. Map $(s, a)$ to a feature vector.
2. Pass through a model (Linear or MLP) to get a scalar $Q$-value.
3. Define a Squared Loss between the prediction and a Target.
4. Update $\theta$ via Stochastic Gradient Descent (SGD).
Target Calculation:
- Monte Carlo: Uses the full rollout utility (no bias, high variance).
- SARSA/Q-Learning: Uses bootstrapping (lower variance, potential bias). Q-learning uses $\max_a Q(s', a)$ as the target, making it off-policy.

3. Policy-Based Methods (Policy Gradient)

Instead of estimating values, these methods directly optimize the policy $\pi_\theta(a|s)$, treating it as a probabilistic classifier.

Imitation Learning: If expert demonstrations are available, this is simple supervised learning (maximizing log-probability of expert actions).
REINFORCE Algorithm:
- Objective: Maximize $J(\theta) = \mathbb{E}{\tau \sim \pi\theta} [U(\tau)]$.
- Gradient: $\nabla_\theta J(\theta) = \mathbb{E}{\tau} [\sum_t \nabla\theta \log \pi_\theta(a_t|s_t) \cdot U(\tau)]$.
- Implementation: Sample a trajectory, compute the gradient of the log-probability of actions taken, and weight it by the trajectory's utility.
Key Insight: If $U(\tau)$ is binary (success/failure), this is equivalent to imitation learning on successful trajectories.

4. Variance Reduction and Baselines

The gradient estimate in REINFORCE can have high variance.

Control Variates: Subtracting a baseline $B(s)$ (a function that does not depend on action $a$) from the utility $U(\tau)$ reduces variance without introducing bias, provided $\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot B(s)] = 0$.
Actor-Critic: Combines policy gradients with value function estimation. The "critic" provides a learned baseline or value estimate to stabilize the "actor's" updates.

5. Synthesis and Conclusion

The lecture establishes a unified view of RL algorithms:

Model-Based: Estimate MDP $\rightarrow$ Value Iteration.
Value-Based: Estimate $Q$ $\rightarrow$ Derive $\pi$ via $\arg\max$.
Policy-Based: Directly estimate $\pi$ $\rightarrow$ Gradient ascent on expected utility.

All these methods share a common structure: Rollout $\rightarrow$ Loss Calculation $\rightarrow$ Gradient Update. The choice between them depends on the application's constraints regarding bias, variance, and the nature of the state space. While tabular methods provide convergence guarantees, function approximation is the standard for modern, complex environments, despite the loss of theoretical convergence guarantees.