Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 1: Class Intro

Deep Reinforcement Learning - CS24R: Course Introduction & Foundations

Key Concepts:

Reinforcement Learning (RL): Learning to make sequential decisions in an environment to maximize a cumulative reward.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
Policy (π): A function mapping states (or observations) to actions.
State (S): A complete description of the environment at a given time.
Observation (O): A potentially incomplete or noisy representation of the state.
Reward (R): A scalar signal indicating the immediate value of an action in a given state.
Trajectory: A sequence of states, actions, and rewards.
Value Function (Vπ(s)): The expected cumulative reward starting from state s and following policy π.
Q-function (Qπ(s, a)): The expected cumulative reward starting from state s, taking action a, and then following policy π.
Discount Factor (γ): A value between 0 and 1 that determines the importance of future rewards.

I. Course Overview & Goals

The course, CS24R, focuses on Deep Reinforcement Learning (DRL). The instructor, Chelsea, highlighted the course's emphasis on solutions that scale with deep neural networks, minimizing focus on non-deep learning approaches. Core learning goals for the first lecture were to define DRL, understand behavior representation, and formulate RL problems. The course will cover imitation learning, model-free/model-based RL, offline/online RL, multi-task/meta RL, and specifically, RL applications for Large Language Models (LLMs) and robotics. Implementation of algorithms and practical examples in robotics control and language models will be prioritized.

II. Reinforcement Learning vs. Supervised Learning

A key distinction was drawn between RL and supervised learning. Supervised learning relies on labeled data (X, Y) sampled independently and identically distributed (IID) to learn a mapping function. RL, conversely, focuses on learning behavior – mapping states to actions (π(s)) – through indirect feedback (rewards) obtained from interacting with an environment. The data in RL is not IID; it’s generated by the policy being learned, creating a feedback loop. This makes RL suitable for problems where direct supervision is unavailable or insufficient. Behavior encompasses a wide range of applications, including motor control (robotics), chatbots, game playing, autonomous driving, and web interaction.

III. Formulating the Reinforcement Learning Problem

The core of RL involves representing experience as data. This is achieved through:

State (S): The complete information about the environment.
Observation (O): A potentially incomplete view of the state. If the observation doesn't fully capture the state, past observations become crucial.
Action (A): The decision made by the agent.
Trajectory: A sequence of states/observations and actions.
Reward (R): A scalar signal indicating the desirability of a state-action pair.

The environment's dynamics are described by P(St+1 | St, At), representing the probability of transitioning to the next state given the current state and action. A crucial property is the Markov property: the next state depends only on the current state and action, not on the entire history. The goal is to maximize the expected cumulative reward, considering the probability distribution over trajectories. A discount factor (γ) can be applied to prioritize immediate rewards over future rewards.

Example Formulations:

Robotics: State = RGB images, joint positions/velocities; Action = commanded joint positions; Reward = 1 if towel is on hook, 0 otherwise.
Chatbot: Observation = user's message; Action = chatbot's response; Reward = +1 for upvote, -1 for downvote, 0 for no feedback.

IV. Why Study Deep Reinforcement Learning?

Several compelling reasons were presented:

Beyond Supervised Learning: RL addresses scenarios where AI predictions have consequences, requiring optimization considering those consequences (e.g., Spotify recommendations).
Direct Supervision Not Always Available: RL excels in situations with indirect feedback (e.g., coding assistants receiving "good job" or "poor job" feedback).
Real-World Applications: DRL powers performant AI systems in robotics, game playing, and increasingly, language models.
Learning from Experience: This is considered fundamental to intelligence, enabling discovery and adaptation beyond mimicking data.
Open Research Problems: DRL presents numerous challenging and exciting research opportunities.
Financial Incentive: The potential for high compensation in the field (cited an example of $1 million total compensation at OpenAI).

Real-World Examples:

Legged Robots: DRL enables robots to learn complex locomotion skills, transferring from simulation to the real world (e.g., "puffy robots," dancing humanoids).
Robotic Manipulation: Robots can learn to perform tasks like laundry folding and origami.
Game Playing (Go): DeepMind’s AlphaGo defeated the world champion, discovering novel strategies (Move 37).
Language Models: Modern LLMs utilize RL for post-training, improving reasoning capabilities.
Chip Design: DRL has been used to design chips for Google’s TPUs.
Traffic Control: Optimizing traffic flow by strategically inserting autonomous vehicles.
Generative Models: Improving image generation based on prompts.

V. Policies and Value Functions

A policy (π) maps states (or observations) to actions, often implemented as a neural network. The policy generates trajectories through interaction with the environment. To evaluate a policy, value functions (Vπ(s)) and Q-functions (Qπ(s, a)) are used. Vπ(s) estimates the expected cumulative reward starting from state s and following policy π, while Qπ(s, a) estimates the expected cumulative reward starting from state s, taking action a, and then following policy π.

VI. Algorithms & Future Topics

The course will cover a range of DRL algorithms, including:

Imitation Learning: Mimicking expert behavior.
Policy Gradients: Directly optimizing the policy.
Actor-Critic Methods: Combining policy optimization with value function estimation.
Value-Based Methods: Estimating the optimal value function.
Model-Based Methods: Learning a model of the environment's dynamics.

The choice of algorithm depends on factors like data availability, supervision type, and action space characteristics.

Notable Quote:

“Learning from experience seems kind of fundamental to intelligence. And so just from a kind of a philosophical level in terms of building intelligent machines, figuring out how to actually develop algorithms that can learn from their own experience seems pretty fundamental to that.” – Chelsea, Instructor.