AI & Cybersecurity: Dan Boneh Interviews Nicolas Carlini

Key Concepts

Adversarial Examples: Inputs to machine learning models that are intentionally crafted to cause misclassification or incorrect behavior.
Transferability: The property of adversarial examples to fool not only the model they were generated on but also other models, even those trained on different data or with different architectures.
White-box vs. Black-box Attacks: White-box attacks assume full knowledge of the model (weights, architecture), while black-box attacks rely on querying the model and observing its outputs.
Prompt Injection: A type of attack where malicious instructions are embedded within a prompt to manipulate an AI model into performing unintended or harmful actions.
Model Extraction: The process of reconstructing a model's functionality or even its exact weights by querying it and observing its responses.
Distillation: A technique to train a smaller, more efficient model by transferring knowledge from a larger, pre-trained model.
Data Poisoning: The act of injecting malicious data into a training dataset to compromise the integrity or behavior of the resulting model.
Differential Privacy: A mathematical framework that provides strong guarantees about the privacy of individual data points in a dataset used for training.
Agentic Models: AI models designed to take actions on behalf of a user, such as booking flights or sending emails.
Universal Adversarial Perturbations: Adversarial examples that are effective across a wide range of inputs and models.

History and Evolution of AI Security

Nicholas Carlini, a pioneer in adversarial examples, discusses his entry into AI security in 2015. He was drawn to the field because machine learning was emerging as a powerful technology, and there was a perceived lack of research on attacking these new models. This area, he notes, has since exploded, with an estimated 12,000 papers published on adversarial examples in 2024, indicating exponential growth. This field's rapid expansion is partly attributed to seminal work like Ian Goodfellow's papers around that time.

Defenses Against Adversarial Examples: Model Secrecy and Transferability

A common defense strategy is to keep model weights hidden ("model secrecy"). However, Carlini highlights that adversarial examples exhibit significant transferability. This means an attacker can generate adversarial examples on a publicly available or differently trained model and then apply them to a target model, even without direct access to its weights. While not always 100% effective, this transferability is often sufficient to compromise the target model.

Mechanism of Transferability: Transferability is stronger when models share similar architectures and are trained on similar data distributions.
Adaptive Attacks: Attackers can further improve transferability by making queries to the target model and using the input-output pairs to fine-tune their attacking model, making it more similar to the target.
Limitations of Secrecy: Keeping model weights hidden offers some barrier but does not provide complete security against sophisticated adversaries. It might increase the cost for an attacker, but it doesn't fundamentally solve the problem.

Real-World Impact of Adversarial Examples

While early examples like misclassifying a cat as guacamole might seem trivial, Carlini emphasizes that adversarial examples have become increasingly relevant as AI models are deployed in critical applications.

Abuse Classifiers: Adversarial examples can be used to bypass content moderation systems, allowing malicious content to be posted online.
Language Models (LLMs):
- "Build a Bomb" Scenario: LLMs are trained to refuse harmful requests. However, adversarial attacks can trick them into providing such information.
- Prompt Hacking vs. Algorithmic Attacks:
  - Prompt Hacking: Involves crafting prompts with contrived scenarios (e.g., "I'm writing a movie script...") to circumvent safety training. Carlini believes these are more likely to be fixed over time due to the models' stateless nature and the absurdity of such pretexts.
  - Algorithmic Attacks (e.g., FGSM): Using algorithms like Fast Gradient Sign Method (FGSM) to generate "gobbledygook" strings that, when appended to a query, confuse the model. These are considered harder to combat.
- Universal Jailbreaks: Research has shown that adversarial strings can be universal, transferring across different queries and even different LLMs (from small MNIST classifiers to trillion-parameter models). This is a significant concern.
Image Classifiers:
- Self-Driving Cars: Stickers on stop signs can be misclassified as speed limit signs, potentially leading to dangerous situations. However, Carlini questions the adversary's motivation, suggesting more direct methods of causing harm might be easier.
- Content Moderation: Adversarial examples can bypass image classifiers used to filter objectionable content on platforms like Twitter.
Positive Applications (Privacy Protection):
- Facial Recognition Defense: Adversarial examples can be used to "poison" one's own facial images uploaded online, potentially making it harder for universal facial classifiers to recognize them.
- Artist Style Protection: Artists can poison their work to prevent models from learning their unique style.
- Challenges: These positive applications face the "last mover advantage" problem, where the model trainer can adapt to adversarial techniques over time, rendering initial defenses obsolete.

Prompt Injection and Agentic Models

Prompt injection is a critical threat for agentic models that can take actions.

Threat Level: Carlini considers this one of the biggest challenges, as it directly impacts the trustworthiness of AI systems. Malicious actors could inject instructions to steal intellectual property, siphon funds, or send sensitive information.
Financial Motivation: Unlike "build a bomb" scenarios, prompt injection has clear financial incentives for adversaries.
Impact on Deployment: The security of prompt injection is seen as a primary blocker for the widespread deployment of agentic AI systems.
Current Mitigation: Many agentic systems currently rely on user confirmation for actions, but users tend to become desensitized and click "yes" without reading.
Future Outlook: Carlini is skeptical about quick solutions for prompt injection, comparing it to the control flow vs. data flow problem in traditional security. While architectures like Camel offer some protection, they have limitations. The challenge lies in separating instructions from data in natural language, which is inherently difficult.

Model Extraction and Distillation

Model extraction aims to steal the functionality or even the exact weights of a proprietary model.

Methodology: Attackers query a remote model, often around its decision boundaries, to infer its internal workings.
Adaptive Querying: Attackers adapt their queries based on the model's responses to reconstruct its behavior.
Exact Model Stealing: Carlini's research has shown that it's possible to steal exact model weights (up to permutations of neurons) under certain assumptions (e.g., specific architectures, sufficient queries, high precision inference).
Distillation for General Utility: A more common approach is to use distillation to train a smaller model that mimics the performance of a larger, proprietary model on the general data distribution. This is a significant concern for companies investing heavily in model training.
Defenses:
- API Restrictions: Limiting the information returned by an API (e.g., not providing full probability vectors) can hinder extraction.
- Output Noise: Adding noise to outputs can make extraction harder but risks degrading model utility.
- Bit-rate Limiting: Restricting the number of bits leaving the model per query.
- Defense in Depth: Employing multiple layers of security measures.

Protecting Training Data

Proprietary training data, especially sensitive information like patient records, needs protection.

Data Leakage: Models, particularly large ones, can inadvertently memorize and regurgitate training data. This is because the training objective often involves reconstructing or predicting parts of the training data.
Membership Inference Attacks: These attacks determine if a specific data point was part of the training set. By combining output generation with membership inference, attackers can extract training data.
Model Size and Memorization: Larger models, with more capacity, are more prone to memorizing training data.
Current State: Models trained on sensitive data should be treated with the same controls as the original data.
Differential Privacy: Offers provable guarantees against memorization but often leads to reduced model utility and slower training.
Practical Solutions: A common approach is to pre-train a general model on public data and then fine-tune it on private data. This reduces the amount of sensitive information the model needs to learn from scratch.

AI for Code Generation

AI models are becoming powerful tools for code generation, but security and quality remain concerns.

Capabilities: Models can generate functional code from natural language prompts, significantly accelerating development. They excel at repetitive tasks like API updates across a codebase.
Limitations: Current models are primarily next-token predictors and don't deeply "understand" code. This can lead to functional but insecure code.
Security Concerns:
- Insecure Code Generation: Most programmers write code that works but is insecure. AI models, by mimicking this, can generate insecure code at scale.
- Lack of Security Understanding: Users who are not computer science experts may not be able to identify security vulnerabilities in AI-generated code.
- Potential for Decreased Software Quality: Over-reliance on AI for code generation could lead to a decline in overall software security and quality.
Future of Software Development:
- Auditing vs. Writing: The role of developers may shift from writing code to auditing AI-generated code for bugs and security flaws.
- Automated Testing: AI can generate extensive test cases, which is a promising area for improving software quality.
- Human Oversight: Humans will likely remain crucial for understanding system behavior, architecting solutions, and ensuring security.

Computer Science Education in the Age of AI

The role of traditional computer science education is evolving.

Fundamental Understanding: Learning programming fundamentals (like C, assembly, algorithms) is still valuable for understanding how computers work and for developing critical thinking skills.
Performance Optimization: Knowledge of low-level details can be crucial for optimizing performance in specific scenarios (e.g., GPU programming).
Prompt Engineering: Understanding programming principles will help users craft more effective prompts for AI code generators.
Adaptability: The curriculum needs to adapt to incorporate AI tools and focus on higher-level problem-solving and auditing skills.

Career Advice in AI Security

Stay Curious and Adaptable: Be interested in new developments, even if they don't align with your initial preferences.
Focus on What's Being Used: Security research is most impactful when applied to technologies that are widely adopted, regardless of whether you personally believe they are the "optimal" approach.
Embrace the "Last Mover" Advantage: Understand that in security, the defender often has the advantage of adapting to new threats.
Security is an Evergreen Field: AI security is a rapidly growing and essential area with significant demand for expertise.
Problem-Solving is Key: The ultimate goal of a software engineer is to solve problems, not just write code. AI tools are changing the methods, but the core objective remains.

Conclusion

Nicholas Carlini's insights highlight the dynamic and challenging landscape of AI security. From the fundamental nature of adversarial examples and their transferability to the emerging threats of prompt injection and model extraction, the field is constantly evolving. While AI offers immense potential, ensuring its security and trustworthiness is paramount. The discussion underscores the critical need for continued research, robust defenses, and a thoughtful approach to AI education and development to navigate the complexities of this transformative technology.