How to defend your sites from AI bots — David Mytton, Arcjet

Key Concepts

Automated Clients (Bots): Malicious and benign bots that access websites.
AI Crawlers: Bots used for training AI models or providing real-time data to LLMs.
User Agent: HTTP header identifying the client making the request.
IP Reputation: Assessing the trustworthiness of an IP address based on its history and network association.
Proof of Work: Requiring clients to solve computational puzzles before accessing resources.
HTTP Message Signatures: Cryptographic signatures included in requests to verify client identity.
Fingerprinting: Identifying clients based on unique characteristics of their network requests.
Rate Limiting: Restricting the number of requests a client can make within a given timeframe.

The Problem of Automated Traffic

High Percentage of Bot Traffic: Almost 50% of web traffic is automated clients, reaching up to 60% in gaming.
Expensive Requests: Bots generate numerous requests, increasing costs, especially on serverless platforms.
Bandwidth Consumption: Bots download large files, consuming bandwidth and impacting resources for legitimate users.
Denial of Service (DoS): Excessive bot traffic can overwhelm servers, causing service unavailability.
AI Exacerbates the Issue: AI crawlers contribute significantly to automated traffic, with examples like GPTBot consuming 24% of traffic on Diaspora.
Real-World Impact: Read the Docs reduced bandwidth usage from 800GB to 200GB per day by blocking AI crawlers. Wikipedia attributes up to 35% of its traffic to automated clients.

Good Bots vs. Bad Bots vs. AI Crawlers

Good Bots: Essential for search engine indexing (e.g., Googlebot), providing benefits like search visibility and traffic.
Bad Bots: Malicious scrapers that download content without permission, causing resource drain.
AI Crawlers: A gray area, with varying purposes and impacts.
- Training Bots (GPTBot): Used to build AI models without direct benefit to the site owner.
- Search Indexing Bots (OpenAI Search Bot): Similar to Googlebot, indexing content for search results within AI platforms, potentially driving traffic and citations.
- Chat Completion Bots (ChatGPT User): Accessing content in real-time to answer user queries, potentially beneficial if users are legitimately using LLMs.
- Computer User Operator Bots: Autonomous agents acting on behalf of users, posing challenges in determining legitimate vs. malicious use (e.g., buying concert tickets for resale).

Defenses Against Malicious Bots

1. Robots.txt

Description: A voluntary standard for instructing crawlers on which parts of a website to access.
Functionality: Allows site owners to specify allowed and disallowed paths for different crawlers.
Limitations: Not enforced, often ignored by malicious bots, and sometimes used to identify restricted content.
Value: A good starting point for understanding site structure and defining crawler behavior.

2. User Agent Analysis

Description: Examining the User-Agent HTTP header to identify the client making the request.
Functionality: Creating rules based on User-Agent strings to allow or block specific bots.
Limitations: Easily spoofed, as clients can set any arbitrary string for the User-Agent.
Resources: ArtJet provides an open-source project with thousands of User-Agent strings for rule creation.

3. IP Verification

Description: Verifying the authenticity of a request by performing a reverse DNS lookup on the source IP address.
Functionality: Confirming that the IP address belongs to the service claimed in the User-Agent (e.g., Google, Apple, OpenAI).
Value: Effective for identifying legitimate crawlers from known services.

4. IP Reputation Analysis

Description: Assessing the trustworthiness of an IP address based on its history, network association, and geolocation.
Functionality: Identifying suspicious traffic originating from data centers, VPNs, proxies, or specific countries.
Data Sources: MaxMind and IPinfo are popular providers of IP geolocation and reputation data.
Challenges: Geolocation data can be inaccurate due to satellite and cellular connectivity. Proxy services can mask the origin of traffic.
Example: 12% of bot traffic on Cloudflare's network originated from AWS.

5. CAPTCHAs and Proof of Work

CAPTCHAs:
- Description: Challenges designed to distinguish between humans and bots.
- Limitations: Increasingly easy for AI to solve, rendering them less effective.
Proof of Work:
- Description: Requiring clients to perform computational tasks before accessing resources.
- Functionality: Making it expensive for bots to crawl large numbers of websites.
- Incentive Considerations: May not deter attacks if the potential profit outweighs the cost of solving the puzzle.
- Accessibility Concerns: Difficult CAPTCHAs can create accessibility issues for legitimate users.
- Open Source Projects: Anubis, Go Away, and Nepenthees are proxies that implement proof-of-work challenges.

6. HTTP Message Signatures

Description: A proposed standard where each request includes a cryptographic signature for client verification.
Functionality: Allowing website owners to quickly verify the identity of automated clients.
Status: Still under development, with questions about its advantages over IP verification.

7. Private Access Tokens (Privacy Pass)

Description: A system developed by Apple that allows website owners to verify that a request is coming from a browser owned by an iCloud subscriber.
Functionality: Reducing the number of CAPTCHAs presented to legitimate users.
Limitations: Limited adoption outside the Apple ecosystem.

8. Fingerprinting

Description: Identifying clients based on unique characteristics of their network requests.
Functionality: Creating a hash (fingerprint) based on TLS or HTTP headers to track clients across multiple IP addresses.
Techniques:
- JA4 Hash: Open-source TLS fingerprinting based on network-level characteristics.
- HTTP Fingerprinting: Proprietary method analyzing HTTP headers and request characteristics.
Value: Effective for blocking clients regardless of IP address changes.

9. Rate Limiting

Description: Restricting the number of requests a client can make within a given timeframe.
Implementation: Applying quotas based on user session ID or fingerprint (e.g., JA4 hash).
Importance of Keying: Rate limiting based solely on IP addresses is ineffective due to IP address changes and botnets.

Conclusion

Combating malicious bots requires a layered approach, starting with robots.txt for good bots and progressing to more sophisticated techniques like user agent analysis, IP verification, IP reputation analysis, proof of work, HTTP signatures, and fingerprinting combined with rate limiting. While no single defense is foolproof, a combination of these methods can significantly reduce the impact of automated traffic and protect website resources. The choice of defenses depends on the specific needs and resources of the website owner.