Stanford CS221 | Autumn 2025 | Lecture 11: Games II

Key Concepts

Minimax Principle: A decision-making rule for two-player zero-sum games where the agent maximizes utility and the opponent minimizes it.
Alpha-Beta Pruning: An optimization technique for minimax search that discards branches that cannot influence the final decision.
Temporal Difference (TD) Learning: A reinforcement learning method that updates value estimates based on the difference between predicted and actual rewards (bootstrapping).
Self-Play: A training paradigm where an agent plays against itself to improve its policy.
Simultaneous Games: Games where players move at the same time (e.g., Rock-Paper-Scissors).
Mixed Strategy: A strategy where a player chooses actions according to a probability distribution.
Minimax Theorem (von Neumann): States that for finite two-player zero-sum games, there exists an optimal mixed strategy equilibrium where the order of play does not change the game value.
Nash Equilibrium: A state in a game where no player can increase their utility by unilaterally changing their strategy.

1. Reinforcement Learning for Games

The lecture transitions from hand-coded heuristics (like those used in early chess programs) to learning evaluation functions using TD Learning.

Value Functions: $V^\pi(s)$ represents the expected utility of state $s$ under policy $\pi$. $Q^\pi(s, a)$ represents the utility of taking action $a$ in state $s$ and following $\pi$ thereafter.
TD Learning Mechanism: Unlike SARSA (which estimates $Q$ values), TD learning estimates $V$ values. The update rule is:
- Target: $R + \gamma V(s')$
- Update: Nudge $V(s)$ toward the target using a learning rate $\eta$.
Function Approximation: In complex games, tabular representations are impossible. Deep neural networks (Value Networks) are used to approximate $V(s)$ using feature vectors of the game state.

2. Real-World Applications & Case Studies

Checkers (1959): Arthur Samuel used self-play and linear evaluation functions on 9KB of memory, achieving amateur-level play.
TD-Gammon (1992): Gerald Tesauro utilized neural networks and self-play without intermediate rewards, reaching human expert level and discovering new strategic insights.
AlphaGo Zero (2017): Used deep reinforcement learning and Monte Carlo Tree Search (MCTS) to master Go. It relied on raw board positions rather than hand-engineered features, surpassing human performance.

3. Simultaneous Games

Unlike turn-based games, simultaneous games cannot be modeled with a standard game tree.

Two-Finger Mora Example: A game where players show 1 or 2 fingers. The payoff matrix determines the utility.
The Advantage of Going Second: In pure strategies, the second player has an information advantage. However, the Minimax Theorem proves that if players use mixed strategies (probability distributions over actions), the value of the game is identical regardless of who moves first.
Optimal Strategy: By setting the expected values of different actions equal to each other, one can solve for the optimal probability $p$ for a mixed strategy.

4. Non-Zero Sum Games

When the sum of utilities is not zero, the Minimax theorem does not apply.

Prisoner’s Dilemma: A classic example where individual rational choices (testifying) lead to a suboptimal outcome for both players compared to mutual cooperation.
Nash Equilibrium: A stable state where no player has an incentive to deviate. Unlike zero-sum games, non-zero sum games may have multiple Nash equilibria, and they are not necessarily globally optimal.

5. Synthesis and Takeaways

Why RL for Games? Even when the rules (MDP) are known, the state space is often too vast for traditional value iteration. RL with function approximation allows agents to generalize across states.
Self-Play: By using the same value function for both the agent (maximizing) and the opponent (minimizing), agents can learn to play optimally against any strategy.
Strategic Stability: In zero-sum games, Minimax strategies provide a "safety net" against any opponent. In non-zero sum games, Nash equilibria provide a framework for understanding stable, albeit sometimes suboptimal, outcomes.

"The Minimax theorem says that for every simultaneous two-player zero-sum game... going first and going second is the same." — Attributed to John von Neumann (1928)