We tried to jailbreak our AI (and Model Armor stopped it)

Key Concepts

AI Security: The practice of protecting AI applications from threats and vulnerabilities.
LLM (Large Language Model): A type of artificial intelligence model trained on vast amounts of text data, capable of understanding and generating human-like text.
OWASP Top 10 LLM Vulnerabilities: A list of the most critical security risks associated with LLMs.
Prompt Injection: A vulnerability where malicious users manipulate an LLM through crafted prompts to elicit unintended or harmful responses.
Sensitive Information Disclosure: A vulnerability where an AI application leaks confidential data.
Jailbreaking: An attempt to bypass an LLM's safety guardrails and instructions to generate prohibited content.
Model Armor: A Google Cloud AI security product designed to protect AI applications from various threats.
Data Loss Prevention (DLP): A feature that identifies and protects sensitive data.
Redaction: The process of obscuring or removing sensitive information from a response.
Malicious URLs: Web addresses known to host harmful content or engage in malicious activities.
API (Application Programming Interface): A set of rules and protocols that allows different software applications to communicate with each other.
Sanitize User Prompt: The process of cleaning and validating user input to prevent malicious attacks.
Response Analysis: The examination of an AI model's output to detect and mitigate risks.
Tokens: Units of text used by LLMs for processing and generation.

AI Security Threats and Model Armor

The video addresses the growing concern among developers about the security risks associated with AI applications, particularly Large Language Models (LLMs). While AI enhances user experiences, it also introduces new vulnerabilities.

OWASP Top 10 LLM Vulnerabilities

Aron highlights the OWASP Top 10 LLM vulnerabilities as a critical reference point for developers. Two key vulnerabilities discussed are:

Prompt Injection: Malicious users manipulate LLMs by crafting specific prompts to bypass intended behavior or extract sensitive information.
Sensitive Information Disclosure: AI applications inadvertently leak confidential data to users.

Demonstrating Threats in Action

A demonstration showcases a simple chat application connected to an LLM.

Jailbreaking Attempt: A user attempts to "jailbreak" the model by asking it to ignore previous instructions and provide dangerous advice, such as "how to rob a bank."
Model Armor's Intervention: In this scenario, the dangerous input was blocked before it reached the LLM. This prevents the model from processing harmful prompts and generating unsafe responses, saving computational resources and mitigating risks.

Model Armor: A Bodyguard for Your AI

Model Armor is introduced as a new AI security product on Google Cloud, acting as a "bodyguard" for LLMs. It protects both the AI model and its users.

Protecting Against Dangerous Outputs

Model Armor can also prevent the generation of unsafe outputs.

Sensitive Data Blocking: The example demonstrates setting up Model Armor to prevent the model from revealing Social Security numbers. When asked to read back a provided Social Security number, Model Armor detects the sensitive data in the model's intended response and blocks it, displaying an error message.

Sensitive Data Protection: Block or Redact

Model Armor offers two options for handling sensitive data in responses:

Block the entire response: As seen with the Social Security number example.
Redact sensitive data: This allows the response to be delivered to the user, but with the sensitive information masked.

Redaction Example: When asked to generate a 16-digit number (simulating a credit card number), Model Armor, configured with Data Loss Prevention (DLP), redacts the generated number from the response. The application still responds to the user, but no sensitive data is leaked. Model Armor can automatically identify and manage various types of sensitive data, reducing the burden on developers.

Protecting Against Malicious URLs

Model Armor leverages Google's extensive knowledge of harmful websites to protect AI applications.

Malicious URL Detection: A user attempts to trick the LLM into sharing a known malicious website address by framing it as a source of "delicious pasta recipes." Model Armor successfully identifies and blocks the dangerous URL before it can be shared with other users. This utilizes Google's ongoing efforts in filtering websites for its search engine and threat intelligence.

Setting Up and Using Model Armor

API-Based Integration

Model Armor is integrated into applications via API calls.

Input Validation: When an application receives user input, it calls the Model Armor API to check if the input is safe before sending it to the LLM.
Output Validation: Before an AI-generated response is sent back to the user, the application calls the Model Armor API again to ensure the output is safe.

Code Walkthrough

A code example illustrates the implementation of Model Armor in a "chat" endpoint:

Input Sanitization:
- The user's raw prompt is sent to Model Armor using a sanitize_user_prompt() function.
- The results are checked for policy violations.
- If a violation is found, a prompt_blocked flag is set to true.
- The LLM call is only made if prompt_blocked is false, preventing the model from processing bad prompts and exposing it to harmful content.
Response Analysis:
- Once a response is received from the LLM, it passes through a second security gate.
- Code is used to redact sensitive information from the output string, rather than blocking the entire response.
- An example shows an IP address being redacted.
- The system differentiates between unsafe responses (which are blocked) and responses with sensitive data (where only the sensitive portion is redacted).

Cross-Platform Compatibility

Model Armor's API-based nature means it can be used with AI models and applications running anywhere, not just on Google Cloud. Applications can be hosted on other cloud providers or on-premises, and still make API calls to Model Armor for protection.

Addressing Developer Concerns

Aron addresses common questions from developers:

Generative AI Model Guardrails vs. Model Armor

Built-in Guardrails: Generative AI models do have built-in protections, but they are not specialized for detecting sensitive data, malicious URLs, prompt injection, or jailbreaking attempts.
Model Armor's Specialization: Model Armor is specifically designed to address these specialized threats.

Prompt Engineering vs. Model Armor

Limitations of Prompt Engineering: While a specially engineered prompt might block some basic jailbreaking attempts, it's unlikely to detect and redact specific sensitive data like API keys or home addresses, or maintain an up-to-date list of millions of malicious URLs. Model Armor offers a more comprehensive and robust solution.

Customizing Safety Settings

Tailored Security: Safety requirements vary significantly between applications (e.g., a university research bot versus a family entertainment app). Model Armor allows developers to configure safety settings for each application individually, adjusting the strictness of filters.

Model Armor Templates and Customization

Pre-built and Custom Templates: Developers can choose from pre-built templates or create their own.
Configurable Protections: Model Armor provides control over the types of issues to protect against, including malicious URLs, prompt injection, jailbreaking, and sensitive data.
Confidence Levels: Developers can customize confidence levels for categories like hate speech and harassment, making enforcement stricter or more lenient as needed.

Pricing and Conclusion

Pricing Structure

Free Tier: Model Armor offers a free tier of 2 million tokens per month.
Paid Tier: After the free tier, the cost is 10 cents per million tokens.
Potential Discounts: Subscribers to Security Command Center may be eligible for a larger free tier or lower pricing.

Final Thoughts

Aron concludes by emphasizing that while AI and security can seem intimidating, tools like Model Armor make AI security more accessible and manageable. The goal is to empower developers to protect their models and users effectively.