Stanford AA228 Decision Making Under Uncertainty | Autumn 2025 | Offline Belief State Planning

Key Concepts

Approximate Offline POMDP Solutions: Methods for finding near-optimal policies in Partially Observable Markov Decision Processes (POMDPs) by performing all computation before deployment.
Upper & Lower Bounds: Techniques to estimate the true optimal value function from above (QMDP) and below (PBVI), providing guarantees on solution quality.
Alpha Vectors: A representation of policies, defining expected returns for each action.
Belief Point Selection: Crucial heuristics for choosing representative beliefs to efficiently approximate the value function.
Trade-offs: Balancing computational cost, accuracy, and the ability to incentivize information gathering in POMDP solution methods.

Introduction & The Challenge of POMDPs

The lecture series continues with a focus on approximating solutions to Partially Observable Markov Decision Processes (POMDPs). Exact solutions are intractable even for moderately sized problems, as demonstrated by the exponential growth in computational complexity with the time horizon – a simple two-action, two-observation POMDP requires 10^338 conditional plans for a time horizon of 10. This necessitates the use of approximate offline methods, where all computation is performed before deployment, accepting a degree of sub-optimality for scalability. A progression of skills is highlighted: design (current course), optimize (spring course), and validate (winter course – CS224V).

Revisiting MDP Solutions & QMDP

The discussion revisits applying Markov Decision Process (MDP) solutions to POMDPs, specifically utilizing the Q-function (state-action value function). QMDP (Q-function MDP) weights Q-values by the belief state, effectively interpolating between them. Mathematically, this involves summing the product of the Q-value for a state-action pair and the probability of being in that state. QMDP is used in systems like the AKSX collision avoidance system and provides an upper bound on the true value function. However, it’s criticized for not incentivizing information-gathering actions, implicitly assuming full state observability at the next time step. QMDP can be re-expressed using alpha vectors, which represent the expected utility of being in a specific state and following the optimal policy thereafter.

Point-Based Value Iteration (PBVI) – Lower Bounds

To improve upon QMDP, Point-Based Value Iteration (PBVI) is introduced. PBVI maintains a set of alpha vectors (capital Gamma) associated with a set of belief points (B). PBVI initializes alpha vectors to a lower bound on the value function, calculated using a best action, worst state approach or by assuming a policy of taking the same action forever. The core of PBVI is the backup operation, which involves iterating through belief points and attempting to improve the current policy.

The PBVI Backup Process

The PBVI backup process involves iterating through each belief point, considering all possible actions and observations, updating the belief based on the action and observation (using methods from Chapter 19), determining the alpha vector that would dominate at the new belief, and replacing the current alpha vector with the new, dominating one. The goal is to iteratively refine the alpha vectors to provide a more accurate lower bound on the optimal value function.

Belief Point Selection Heuristics

The choice of belief points is crucial for efficient planning. Two heuristic methods are detailed: Random Belief Expansion and Exploratory Belief Expansion. Random Belief Expansion randomly selects an action from a belief, simulates the resulting observation, updates the belief, and adds the new belief to the set. Exploratory Belief Expansion considers all possible actions, calculates the resulting beliefs, and selects the belief furthest from the current set. These methods are acknowledged as heuristics, not guaranteed to find the optimal set of belief points, but domain knowledge can improve their performance.

Randomized PBVI & Trade-offs

Randomized PBVI aims to reduce computational cost by randomly selecting a belief point for backup in each iteration and pruning dominated beliefs after each update. This introduces a trade-off between accuracy and computation. The segment emphasizes the importance of focusing on reachable beliefs to avoid wasting resources on irrelevant states. QMDP and FIB provide tighter upper bounds, but at a higher computational cost. Lower bound estimates are often used to initialize other algorithms like PBVI.

Real-World Examples & Applications

The concepts are illustrated with examples like the crying baby problem (demonstrating convergence of alpha vectors) and one-on-one basketball (highlighting the irrelevance of unreachable beliefs). A hypothetical scenario involving safety-critical aircraft collision avoidance underscores the importance of accurate value function estimation even for rare, high-consequence beliefs.

Conclusion

The lecture highlights the challenges of solving POMDPs and the importance of approximate offline methods. Techniques like QMDP and PBVI offer ways to obtain upper and lower bounds on the optimal value function, with a crucial trade-off between computational cost and accuracy. Effective belief point selection is paramount for efficient planning, and understanding the limitations of these heuristics is essential for applying them to real-world problems. The overall takeaway is that while finding exact solutions to POMDPs is often impossible, these approximate methods provide valuable tools for making informed decisions in uncertain environments.