Stanford CS230 | Autumn 2025 | Lecture 10: What’s Going On Inside My Model?

Key Concepts

Neural Network Interpretability: Moving beyond “black box” models to understand the reasoning behind their decisions, crucial for trust, debugging, and safety.
CNN Interpretability Techniques: Methods like salency maps, integrated gradients, occlusion sensitivity, CAM, and deconvolution for visualizing feature importance in convolutional networks.
Transformer Interpretability Techniques: Visualizing attention patterns and embeddings to understand relationships between tokens and semantic meaning in large language models.
Model Diagnostics: Systematic troubleshooting of models through analysis of training curves, gradient norms, scaling laws, and data characteristics.
Data Integrity: Ensuring the validity of benchmarks by identifying and mitigating data contamination and addressing data distribution imbalances.

Understanding Neural Network Interpretability & Frontier Models

This lecture series focuses on understanding the inner workings of neural networks, transitioning from established interpretability techniques for Convolutional Neural Networks (CNNs) to the challenges posed by modern “frontier” models (large language and vision models). The core theme is moving beyond “black box” models to understand why they make certain decisions, particularly crucial for building trust and debugging issues. A case study approach is used, exploring methods for understanding model behavior in both practical and research contexts, exemplified by a scenario involving a 200 billion parameter model experiencing worsening reasoning, failing safety evaluations, and latency spikes.

Interpreting Convolutional Neural Networks (CNNs)

Interpreting CNNs involves understanding the relationship between input and output. Techniques include calculating salency maps (derivatives of the output score with respect to the input image, using the pre-softmax score), integrated gradients (integrating gradients along a path from a baseline image to the input), and occlusion sensitivity (masking portions of the input and observing the impact on the output). Class Activation Maps (CAM) modify the CNN architecture to visualize feature map contributions to class predictions. Class Model Visualization uses gradient ascent to generate inputs maximizing specific neuron or class score activations, while Data Set Search identifies inputs strongly activating particular feature maps.

Deconvolution (or transposed convolution/sub-pixel convolution) is a key technique for reverse engineering, reconstructing the input that led to a specific activation. The process involves sending an input through the network, identifying the highest activation in a feature map, zeroing out other activations, and then reversing the network using unpooling (reconstructing activation maps using “switches” recording max value locations during forward propagation) and filters to pinpoint the input pixels responsible for that activation.

Early research visualized the first layer of CNNs, revealing filters sensitive to edges (diagonal or straight). Deeper layers revealed filters detecting more complex shapes, demonstrating that deeper layers represent increasingly abstract features. The size of the cropped input region corresponds to the layer depth – smaller for earlier layers, larger for later layers.

Interpreting Modern Models: Transformers & Beyond

The focus shifts from localized information in CNNs to relationships and meanings between concepts in modern models like Transformers. Attention mechanisms reveal relationships between tokens (words) in a sequence, with each attention head learning different patterns. Visualizing these patterns is analogous to CNN salency maps. Embeddings represent how the language model perceives words, and dimensionality reduction techniques like t-SNE can visualize these embeddings, grouping semantically similar tokens together. Advanced research introduces concepts like “induction heads” and “transformer circuits” for deeper understanding, though these are complex.

Diagnosing Model Performance: Training, Scaling & Data

Troubleshooting model performance requires a multi-level analysis, assessing the language model itself and its performance within an agentic workflow. Diagnostics fall into four categories: training & scaling (loss curves, gradients, scaling laws), representation & internal aspects (attention heads, embeddings), own-level behaviors (benchmarking), and data in-distribution (contamination, bias).

Monitoring loss curves, gradient norms, and learning rate schedules provides insights into the training process. Scaling laws (DeepMind’s Chinchilla paper, 2022) demonstrate that performance is more strongly correlated with the amount of training data than model size, and help determine whether to invest in compute, data, or model capacity. For Mixture of Experts (MoE) models, monitoring the router is crucial to ensure all experts are utilized effectively.

Data diagnostics involve tracking domain proportions, token statistics, and performing contamination checks (using ngram searches, hash comparisons, and embedding similarity checks) to ensure benchmark validity. Addressing data drift and underrepresentation of specific domains is also critical.

Conclusion

Understanding neural network interpretability is paramount for building trust, debugging complex models, and ensuring responsible AI development. While established techniques exist for CNNs, interpreting frontier models like Transformers requires new approaches focused on visualizing attention patterns and embeddings. A systematic diagnostic approach, encompassing training dynamics, scaling laws, and data integrity, is essential for identifying and resolving performance issues. The field is rapidly evolving, demanding continuous research and adaptation to unlock the full potential of these powerful models.