MIT 6.S191 (2024): Reinforcement Learning

Key Concepts

Agent: An entity that takes actions within an environment.
Environment: The world in which the agent operates and interacts.
State: The current observation or situation of the agent within the environment.
Action: A decision or command that the agent sends to the environment.
Reward: Feedback from the environment that measures the success or failure of the agent's actions.
Discount Factor (Gamma): A value between 0 and 1 that reduces the impact of future rewards, prioritizing immediate rewards.
Q-function: A function that estimates the expected total future reward for taking a specific action in a given state.
Policy Function: A function that determines the optimal action to take in a given state.
Value Learning: Algorithms that learn the Q-function to indirectly determine the optimal policy.
Policy Learning (Policy Gradients): Algorithms that directly learn the policy function.
Rollout: Executing a policy in the environment to observe the resulting states, actions, and rewards.
Stochastic Policy: A policy that outputs a probability distribution over actions, allowing for exploration.
Deterministic Policy: A policy that always selects the action with the highest Q-value.

Supervised Learning vs. Unsupervised Learning vs. Reinforcement Learning

Supervised Learning: Learns a mapping from input data (X) to output labels (Y) using labeled datasets. Example: Classifying images of apples.
Unsupervised Learning: Learns the underlying structure of data (X) without labels. Example: Clustering images of similar objects together.
Reinforcement Learning: Learns to maximize future rewards by interacting with an environment through states and actions. Example: An agent learning to eat an apple for nutrition.

Reinforcement Learning Terminology and Concepts

Agent: The entity that interacts with the environment. Examples: a drone delivering packages, Mario in a Super Mario game, a self-driving car.
Environment: The world in which the agent exists and interacts.
State (s): The observation of the environment presented to the agent.
Action (a): A command the agent sends to the environment. The set of all possible actions is denoted by 'A'. Example: moving forward, backward, left, or right.
Reward (r): Feedback from the environment indicating the success or failure of an action. Can be immediate (e.g., touching a gold coin) or delayed.
Total Reward (R(t)): The sum of all rewards up to time t.
Discounted Reward: Future rewards are multiplied by a discount factor (gamma) to prioritize immediate rewards. This encourages short-term learning.
Q-function (Q(s, a)): Estimates the expected total future reward (return) for taking action 'a' in state 's'.
Policy (π(s)): A function that determines the optimal action to take in a given state 's'. It can be derived from the Q-function by selecting the action that maximizes the Q-value.

Value Learning: Deep Q-Networks (DQN)

Goal: Learn the Q-function to determine the optimal policy.
Example: Atari Breakout: The agent (paddle) moves left or right to bounce a ball and break blocks. The Q-function estimates the expected reward for each action in a given state.
Network Architecture:
- Input: State (e.g., pixels on the screen).
- Output: Q-values for each possible action.
Training Process:
1. The agent observes the current state.
2. The agent uses the Q-network to predict Q-values for each possible action.
3. The agent selects the action with the highest Q-value (or explores using a stochastic policy).
4. The agent executes the action and receives a reward and the next state from the environment.
5. The agent calculates the target Q-value using the Bellman equation: Target Q = Reward + Discount Factor * max(Q(next state, all actions)).
6. The agent updates the Q-network to minimize the difference between the predicted Q-value and the target Q-value using a loss function (e.g., mean squared error).
Q-Loss: The mean squared error between the predicted Q-value and the target Q-value.
DeepMind's Atari Results: Deep Q-Networks (DQN) surpassed human performance in over 50% of Atari games using only pixel input and no human supervision.

Downsides of Q-Learning

Complexity: Works best with small, discrete action spaces.
Flexibility: Cannot directly learn stochastic policies; policies are deterministic (always choosing the action with the highest Q-value).

Policy Learning (Policy Gradients)

Goal: Directly learn the policy function (π(s)) that maps states to actions.
Approach: Optimize the policy function to maximize the expected reward.
Network Architecture:
- Input: State.
- Output: A probability distribution over possible actions.
Action Selection: Sample an action from the probability distribution output by the policy network. This allows for exploration and stochastic policies.
Advantages over Q-Learning:
- Can handle continuous action spaces.
- Can learn stochastic policies.
Continuous Action Spaces: Instead of predicting probabilities for discrete actions, the network learns the parameters (e.g., mean and standard deviation) of a probability distribution (e.g., Gaussian) over continuous actions.
Example: Self-Driving Car:
- Agent: The car.
- Environment: The road.
- State: Sensory information (cameras, LiDAR, radar).
- Action: Steering wheel angle.
- Reward: Distance driven without crashing.
Training Process:
1. Initialize the agent in the environment.
2. Execute a rollout of the policy, recording state-action pairs.
3. Calculate the total discounted reward for the rollout.
4. Update the policy to increase the probability of actions that led to high rewards and decrease the probability of actions that led to low rewards.
Loss Function: Loss = -log(P(action | state)) * Total Discounted Reward
Policy Gradient: The gradient of the policy term in the loss function.

Applying Policy Gradients in Real Life

Challenges: Crashing the car during training is not feasible.
Solution: Use a hyper-photorealistic simulator to train the policy in a safe environment.
Sim-to-Real Transfer: Train the policy in simulation and then deploy it on a real car.

Advanced Applications: Game of Go

Complexity: The game of Go has a massive state space, making it challenging for AI.
Approach:
1. Supervised Learning (Imitation Learning): Train a neural network to mimic human gameplay by watching games of humans playing the game of Go.
2. Reinforcement Learning (Self-Play): Use the pre-trained model to play against itself in a reinforcement learning fashion. The agent that wins the game has its actions reinforced, while the agent that loses has its actions penalized.
AlphaGo Zero: A variation that starts from a completely random network and learns to play Go entirely through self-play, achieving superhuman performance.

Conclusion

The lecture provides a comprehensive overview of reinforcement learning, covering the fundamental concepts, key algorithms (Q-learning and policy gradients), and advanced applications. It highlights the differences between value learning and policy learning, the advantages and disadvantages of each approach, and the challenges of applying reinforcement learning in real-world scenarios. The lecture emphasizes the importance of exploration, the use of simulators for safe training, and the potential for AI to surpass human performance in complex tasks.