Why you should care about AI interpretability - Mark Bissell, Goodfire AI

Key Concepts

Mechanistic Interpretability, Reverse Engineering Neural Networks, Neuron-level Programming, Sparse Autoencoders, Model Debugging, Dynamic Prompting, Feature Attribution, Explainable Outputs, Superhuman Models, Model Diffs, Neural Programming, AI Safety, PII Detection, Red Teaming, Guardrails, Concept Painting.

Developing with AI Systems: The Need for Rigor

The speaker, Mark Bissell from Goodfire, highlights the challenges in developing with AI systems, particularly LLMs, due to their non-deterministic nature and the difficulty in making precise guarantees about their behavior. He uses the common anecdote of building an agent that ignores instructions, leading to a "whack-a-mole" approach to prompt engineering, where fixing one issue causes others to break.

Traditional solutions like using LLMs as judges are deemed unscalable due to cost and the need to monitor another system. Fine-tuning, while promising, requires domain-specific data and can lead to unintended consequences like spurious correlations, mode collapse, or reward hacking.

The core issue is a lack of rigor compared to traditional software development. Interpretability, specifically through Goodfire's Ember platform, offers a solution by enabling debugging and programming at the neuron level, providing guarantees similar to those in traditional software.

Ember Platform Demo: Neural Programming

Bissell demonstrates Ember with a live demo involving a Llama model. He shows how the model initially fails to keep an email address confidential. Using Ember's attribution feature, he identifies the features the model was considering when outputting the email, including "discussions of sensitive and protected information."

By increasing the activation of this feature, he steers the model to take PII more seriously, resulting in the model refusing to share the email. This demonstrates neural programming, where the model's behavior is directly influenced by manipulating its internal features.

Another example is dynamic prompting, where a different prompt is injected based on the active features. In the demo, when the model starts discussing drinks to pair with pizza, a feature related to beverages and consumer brands fires, triggering a prompt that makes the model recommend Coca-Cola. This intervention is seamless to the user.

Ember is already being used by companies like Rakuten for multilingual PII detection and Hayes Labs for red teaming. Goodfire is also exploring model diffs, allowing developers to compare models post-training and identify changes in features, such as the model becoming syncopantic.

UI/UX Implications: Paint with Ember

Bissell showcases Paint with Ember (paint.goodfire.ai), a live demo that allows users to paint with concepts learned by an image model. Instead of text prompts, users can directly paint concepts like "pyramid structure" onto a canvas, which then influences the generated image.

Users can manipulate the painted concepts, move them around, erase them, and replace them with others. They can also adjust the strength of features, like making a lion open its mouth more or less.

By clicking into concepts, users can explore sub-features and steer them. For example, by manipulating the sub-features of a lion face, users can interpolate between a lion and a tiger, revealing how the model conceptualizes these concepts.

Other Use Cases and Philosophical Argument

Beyond the demonstrated use cases, Goodfire is excited about explainable outputs for regulated industries, extracting scientific knowledge from superhuman systems, and improving model efficiency.

They are working with the ARC Institute to understand what biological concepts their foundational genomics models (EVO2) have learned that humans don't know. They are also collaborating with a major health system to identify novel biomarkers of disease using genomics-based models.

Bissell concludes by arguing that interpretability is a fascinating and important problem because it addresses the fundamental lack of understanding of how these models work. He believes that engineers, by nature, want to understand how systems function, making interpretability a compelling field to explore.

Question and Answer

In response to a question about how to find these features, Bissell mentions sparse autoencoders as the current best practice. He notes that this is an active area of research and expects new techniques to emerge in the coming years.

Conclusion

The presentation argues that mechanistic interpretability is moving from the lab to real-world applications, offering practical value for AI engineers. It provides tools for debugging, steering, and understanding AI models at a granular level, leading to more reliable, controllable, and insightful AI systems. The demos of Ember and Paint with Ember showcase the potential of interpretability to revolutionize AI development and user interaction.