Stanford AA228V I Validation of Safety Critical Systems I Explainability

Key Concepts

Explainability & Interpretability: Methods to understand why AI models make specific decisions, crucial for safety-critical systems.
Shapley Values: A game-theoretic approach to attribute the contribution of individual features or time steps to a model's outcome.
Policy Visualization: Mapping state spaces to identify "dead zones" or regions where a model lacks training data.
Spurious Correlations: When models rely on irrelevant patterns (e.g., background colors or timestamps) rather than robust features (the "Clever Hans" effect).
Saliency Maps & Integrated Gradients: Techniques to visualize which input pixels or features most influence a model's prediction.
Mechanistic Interpretability: The frontier of AI research aimed at reverse-engineering the internal "circuits" (e.g., sparse autoencoders) of neural networks to understand their reasoning.
Causal vs. Bayesian Networks: Distinguishing between statistical correlation (Bayesian) and causal mechanisms (Causal) to predict outcomes under intervention.

1. Project 3 Results: Reachability Analysis

The lecture began by reviewing the leaderboard for Project 3, which focused on reachability analysis.

Small Systems: Winners utilized box over-approximation and PCA-aligned rectangles to define state boundaries.
Medium Systems: Top performers employed Taylor expansion (first-order and Hessian-based second-order derivatives) to linearize systems.
Large Systems: The "AI squared" verification technique was the dominant approach. Success was attributed to focusing partitioning on downstream states rather than initial states.

2. The Three Pillars of Explainability

The lecturer framed interpretability through three critical questions for stakeholders (e.g., regulators, investors):

Why did the failure happen? (Attribution)
How can we mitigate it? (Actionable improvement)
How can we guarantee it won't happen again? (Verification)

3. Methodologies for Explainability

A. Feature Attribution (Time Series)

Leave-one-out Analysis: Re-simulating trajectories by zeroing out noise at specific time steps to identify the "faulty" moment.
Shapley Values: Used to handle redundant or synergistic features. By calculating the average marginal contribution of a feature across all possible subsets, one can quantify its impact. Limitation: The combinatorial complexity ($N!$) makes it computationally infeasible for high-dimensional spaces.

B. Policy Visualization

By plotting the state space, engineers can identify "dead zones" where the model behaves erratically. This often reveals that the model was trained on a limited distribution (e.g., behavioral cloning of an expert) and fails when pushed into edge cases.

C. Vision Model Interpretability

Saliency Maps: Computing the gradient of the loss with respect to input pixels. Limitation: Often produces noisy, uninterpretable results due to numerical issues in softmax layers.
Integrated Gradients: Interpolating from a baseline (e.g., a black image) to the input to accumulate gradients, providing a more robust feature importance map.
Grad-CAM: Differentiating through semantic layers (feature maps) rather than pixels to localize concepts (e.g., identifying that a model looks at a dog's head vs. a cat's tail).
Sanity Checks: The lecturer emphasized that many interpretability methods fail "randomization tests," where the explanation remains unchanged even if the model's weights are randomized.

4. Mechanistic Interpretability (The Frontier)

The lecture transitioned to the challenge of understanding internal model logic, specifically in LLMs.

The Problem of Spurious Internal Logic: Even if sensitive features (e.g., ethnicity) are removed from inputs, models may reconstruct them via correlations with other features (e.g., zip code).
Sparse Autoencoders (SAEs): A method to decompose high-dimensional embeddings into sparse, interpretable "directions" or basis vectors. By training an encoder-decoder with an L1 penalty, researchers can isolate specific concepts (e.g., "Golden Gate Bridge").
Circuit Tracing: By identifying these conceptual nodes, researchers can build causal graphs to trace how an LLM arrives at a specific output. This allows for interventions—such as zeroing out an "ethnicity" feature at runtime to ensure fairness.

5. Synthesis and Conclusion

The lecture concluded that while simple attribution methods (Shapley, Saliency) are useful for low-dimensional systems, they are insufficient for modern, complex models. The field is shifting toward mechanistic interpretability, which treats neural networks as systems to be reverse-engineered. The ultimate goal is to move from purely observational statistical models to causal models that allow for robust interventions and formal verification, ensuring that AI systems are not just accurate, but explainable and safe.