Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

Here's a comprehensive summary of the YouTube video transcript, maintaining the original language and technical precision:

Key Concepts

Deep Reinforcement Learning (DRL): The combination of Deep Learning and Reinforcement Learning.
Reinforcement Learning (RL): A machine learning paradigm focused on an agent learning to make sequences of good decisions through experience.
Agent: The entity that interacts with the environment.
Environment: The external system with which the agent interacts.
State (S): The current configuration of the environment.
Observation (O): What the agent perceives from the environment, which may not be the full state.
Action (A): A decision made by the agent.
Reward (R): A scalar feedback signal from the environment indicating the desirability of an action.
Return: The cumulative discounted reward over a sequence of actions.
Discount Factor (γ): A value between 0 and 1 that weighs future rewards less than immediate rewards.
Transition: The process of moving from one state to another due to an action.
Q-Table: A lookup table storing the expected future reward (Q-value) for taking a specific action in a specific state.
Q-Learning: An RL algorithm that learns Q-values.
Bellman Optimality Equation: A fundamental equation in RL that defines the optimal Q-function.
Policy: A function that maps states to actions, defining the agent's strategy.
Deep Q-Network (DQN): A neural network used to approximate the Q-function, enabling DRL.
Experience Replay: A technique to store and reuse past experiences, improving data efficiency and reducing correlation.
Exploration vs. Exploitation: The trade-off between trying new actions to discover better strategies (exploration) and using known good actions (exploitation).
Epsilon-Greedy: A common strategy for balancing exploration and exploitation by taking a random action with a small probability (epsilon).
Reinforcement Learning from Human Feedback (RLHF): A technique to align language models with human preferences.
Supervised Fine-Tuning (SFT): Fine-tuning a pre-trained model on human-written demonstrations of desired behavior.
Reward Model: A separate model trained to predict human preferences for given responses.
Proximal Policy Optimization (PPO): A policy-based RL algorithm that learns policies directly.

Deep Reinforcement Learning: The Marriage of Deep Learning and Reinforcement Learning

Deep Reinforcement Learning (DRL) is presented as the fusion of Deep Learning and Reinforcement Learning. The lecture aims to explain how Reinforcement Learning works and how neural networks can be integrated to build intelligent RL agents.

Motivation and Successes of RL

Reinforcement Learning has achieved remarkable success in various domains, often exceeding human performance. Key examples include:

Human-Level Control Through Deep Reinforcement Learning (DeepMind): A single algorithm trained AI to play over 40-50 Atari games at a superhuman level.
AlphaGo (DeepMind): An algorithm that defeated and surpassed human performance in the complex game of Go.
Strategy Games (e.g., StarCraft, Dota): RL has been applied to complex multi-agent games requiring long-term and short-term strategic thinking, as well as collaborative play.
Reinforcement Learning with Human Feedback (RLHF): A recent development (2022) that significantly improved language model alignment with human preferences, contributing to the leap from GPT-2 to ChatGPT.

The Limitations of Supervised Learning for Complex Tasks

The lecture highlights why traditional supervised learning, which relies on labeled data, is insufficient for tasks like game playing.

The Game of Go Example:
- Data Collection: Using historical games from professional players as labeled data (board state as input, next board state as output) is problematic.
- Shortcomings:
  - Incomplete State Space: It's impossible to capture all possible board states from historical data.
  - Ill-defined Ground Truth: Even expert moves might not be optimal, and human performance varies. The "best" move is not always clear.
  - Lack of Strategy Understanding: Supervised learning only learns to mimic specific moves, not the underlying long-term strategy.

Reinforcement Learning: Learning by Experience

RL addresses these limitations by focusing on learning through experience rather than explicit examples. The core idea is to make good sequences of decisions automatically.

Core RL Vocabulary and Concepts

Agent and Environment: The agent interacts with the environment.
State (S) and Observation (O): The environment has a state, but the agent might only receive an observation, which can be partial (e.g., "fog of war" in games like StarCraft or League of Legends).
Action (A): The agent performs actions (e.g., placing a stone in Go).
Reward (R): A scalar feedback signal received after an action.
Transition: The process of moving from state $S_t$ to $S_{t+1}$ after taking action $A_t$.
Return: The cumulative discounted reward, defined as $R = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}$, where $\gamma$ is the discount factor.
Discount Factor (γ): A value between 0 and 1 that makes future rewards less valuable than immediate ones, reflecting concepts like inflation or energy decay.

Illustrative Example: The Recycling Game

A simple five-state recycling game is used to illustrate RL concepts:

States: Initial (State 2), Garbage (State 1), Empty (State 3), Chocolate Packaging (State 4), Recycle Bin (State 5).
Actions: Left, Right.
Rewards: +2 for garbage, +1 for chocolate, +10 for recycle bin.
Terminal States: Garbage (State 1) and Recycle Bin (State 5).
Constraint: A garbage collector arrives in 3 minutes, and each state transition takes 1 minute, preventing infinite loops of collecting chocolate.

Calculating Long-Term Return and Strategy

Q-Table: A matrix of states x actions, where each entry $Q(S, A)$ represents the expected discounted return of taking action $A$ in state $S$. The goal is to learn this table.
Backtracking Algorithm: Used to compute discounted returns by traversing a tree representation of the environment.
- For a terminal state, the return is the immediate reward.
- For non-terminal states, the return is the immediate reward plus the discounted maximum future return from the next state.
Bellman Optimality Equation: $Q^(S, A) = R + \gamma \max_{A'} Q^(S', A')$. This equation states that the optimal Q-value for a state-action pair is the immediate reward plus the discounted maximum Q-value of the next state.
Policy: Derived from the Q-table by selecting the action with the highest Q-value for a given state ($\text{argmax}_A Q^*(S, A)$).

The Challenge of Large State Spaces and the Rise of Deep Q-Learning

The Q-table approach becomes intractable for games with vast state spaces (e.g., Go, Chess). This is where Deep Learning comes in.

Deep Q-Network (DQN): A neural network is used to approximate the Q-function. Instead of a lookup table, the network takes the state (and potentially action) as input and outputs Q-values for each action.
Training a DQN:
- No Explicit Labels: Traditional supervised learning labels are unavailable.
- Leveraging the Bellman Equation: The Bellman equation provides a target for training. The loss function aims to minimize the difference between the predicted Q-value and a target value derived from the Bellman equation: $Y = R + \gamma \max_{A'} Q(S', A')$.
- Target Calculation: This involves two forward passes: one to get the current Q-values and determine the best action, and another to get the Q-values for the next state to compute the target.
- Fixed Target: To simplify differentiation, the target Q-values derived from the network are treated as fixed during backpropagation.
- Discounting: The discount factor is crucial for making decisions that prioritize shorter, more rewarding paths.

Practical Techniques for Training RL Agents

Several techniques are essential for effective RL training:

Preprocessing:
- Input Representation: For games like Breakout, the input is the screen pixels.
- Dimensionality Reduction: Converting to grayscale, cropping irrelevant parts of the screen (e.g., score, background).
- History: Including a history of past frames (e.g., 4 frames) is crucial to infer motion and direction (e.g., ball trajectory).
- Architecture: Convolutional Neural Networks (CNNs) are well-suited for processing pixel-based inputs.
Experience Replay:
- Problem: Consecutive experiences within an episode are highly correlated, leading to inefficient training.
- Solution: Store transitions (S, A, R, S') in a replay memory. Sample mini-batches randomly from this memory for training.
- Benefits: Improves data efficiency, reduces correlation, and allows for reuse of experiences.
Handling Terminal States:
- When an episode ends in a terminal state, the target is simply the immediate reward, not the Bellman equation.
Exploration vs. Exploitation (Epsilon-Greedy):
- Problem: Agents can get stuck in local optima if they only exploit known good actions.
- Solution: With a probability $\epsilon$, take a random action (exploration); otherwise, take the action with the highest Q-value (exploitation).
- Benefit: Allows the agent to discover potentially better strategies.
Quantifying Performance:
- Reward Monitoring: Observing increasing rewards during training.
- Loss Function: Monitoring if the Bellman equation is respected.
- Competitive Self-Play: Having agents play against each other to identify superior models.

Advanced RL Topics and Challenges

Montezuma's Revenge: A game that highlights the challenge of delayed rewards and the need for intuition or prior knowledge. Applying random exploration is highly unlikely to lead to the correct sequence of actions to reach the goal.
Imitation Learning: Learning from human demonstrations to bootstrap the learning process.
Policy-Based Methods (e.g., PPO): Learn the policy directly, which is more suitable for continuous action spaces (e.g., autonomous driving) where discretizing actions is not ideal. PPO is also more probabilistic.
Reward Design: The choice of reward structure significantly impacts the agent's strategy. Intermediate rewards can guide learning, but end-to-end rewards might lead to more optimal, albeit harder-to-train, solutions.
Multi-Agent RL: Games like OpenAI Five (League of Legends) and AlphaStar (StarCraft) involve collaboration and complex observations (fog of war), adding significant complexity.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a crucial technique for aligning large language models (LLMs) with human preferences, addressing limitations of standard pre-training.

Limitations of Pre-trained Language Models

Data Not Reflective of Helpfulness: Online text data (e.g., Wikipedia, Reddit) is not optimized for answering questions or being helpful.
Lack of Concept of "Good": Models don't inherently understand politeness, helpfulness, or safety. They can generate factually correct but unhelpful or even harmful responses.

Improving Language Models with Human Input

Two main approaches are used:

Supervised Fine-Tuning (SFT):
- Process: Collect human-written prompt-response pairs. Fine-tune a pre-trained LLM on this dataset to imitate desired behavior.
- Shortcomings:
  - Costly Data Collection: Requires significant human effort to create high-quality demonstrations.
  - Limited Generalization: May not generalize well to unseen prompts as it's still imitation-based.
Reinforcement Learning from Human Feedback (RLHF):
- Goal: Optimize for human preferences, not just imitation.
- Steps:
  - Train a Reward Model (RM):
    - Sample multiple responses to a prompt from the SFT model.
    - Ask human labelers to rank these responses based on preference.
    - Train a separate model (the Reward Model) to predict these human preferences. This model takes a prompt and response and outputs a scalar reward score.
    - The RM is initialized from the SFT model, with its output layer modified to predict a scalar reward.
    - The RM is trained using a loss function that encourages higher scores for preferred responses.
  - RLHF Training:
    - The SFT model acts as the agent.
    - The environment is the space of prompts and generated text.
    - The state is the prompt plus the tokens generated so far.
    - The action is choosing the next token.
    - The reward is estimated by the trained Reward Model at the end of a generated sequence.
    - The LLM is fine-tuned using RL (similar to Q-learning principles) to maximize the expected reward from the RM.
- Benefits:
  - Scalability: The RM can evaluate responses much faster than humans.
  - Preference Optimization: Directly optimizes for what humans prefer, leading to more aligned and helpful outputs.
  - Efficiency: Asking for preferences is generally easier and faster for humans than writing full responses.

RLHF in Practice

Sparse Rewards: The reward is typically given at the end of a generated sequence, making it a sparse reward episodic task, similar to games where rewards are only given upon winning or losing.
Sequential Decision Making: The LLM needs to make a sequence of good token choices to produce a high-scoring, preferred response.
Outcome: The process transforms a pre-trained LLM into an SFT model, and then further refines it using RLHF into a significantly better, human-aligned model.

The lecture concludes by emphasizing that while RL is powerful, human minds are often more efficient, and RL still has limitations, as discussed in a linked video by Andrej Karpathy.