How to detect and secure your applications from AI bots using F5 Distributed Cloud WAF

Key Concepts:

AI Scraping Bots
Robots.txt directives (User-agent, Disallow, Disallow AI-Training, Allow AI-Training)
F5 Distributed Cloud Web Application Firewall (F5XC WAF)
Scrapy (Python scraping library)
Direct Response Route
Client Blocking Rule
Suspicious Bot Blocking
Service Policy (TLS Fingerprint, User Agent)

1. Introduction to AI Scraping Concerns

Web scraping is not new, but the use of scraped data to train AI models has increased concerns.
Generative AI models can now retrieve dynamic content from websites using advancements like RAG (Retrieval-Augmented Generation), AI Agents, and MCP (presumably, Model Control Plane).

2. Robots.txt and AI-Specific Directives

IETF RFC 9309 defines standard ways to control data access for automated bots using directives in robots.txt.
A new RFC suggests adding Disallow AI-Training (prohibits data access for AI training) and Allow AI-Training (permits data fetching for AI training) directives.

3. F5XC WAF Mitigation Solutions

The demo focuses on five mitigation solutions within F5 Distributed Cloud WAF (F5XC WAF).

4. Solution 1: Preventing Bots Obeying AI Directives

Tool: Python Scrapy library is used, which allows customization of user agents and robots rules.
Configuration: Scrapy's settings file is modified to include the new AI directives. Prigggo is used as the default parser.
Direct Response Route: F5XC is configured to serve a custom robots.txt file using a direct response route. This route specifies the content F5XC sends back to the client.
Example: A direct response route is created to serve a robots.txt file with Disallow AI-Training.
Demonstration:
- The crawler initially accesses the website content when not obeying robots.txt.
- When Scrapy is instructed to obey robots.txt, access is restricted.
- The robots.txt is modified to Allow AI-Training, and the crawler is able to access the content.

5. Solution 2: Mitigating Bots Obeying RFC 9309 (But Not AI Directives)

Bots that obey RFC 9309 but don't support the new AI directives can be mitigated using the standard Disallow directive.
Process: Modify the load balancer configuration with the Disallow directive. Rerun the script, and the request will be blocked based on the robots obey field.

6. Solution 3: Blocking Specific Scraping Tools (User Agent Blocking)

Scenario: Bots don't obey robots.txt directives, and the user agent header is known.
Process:
1. Update the user agent field in Scrapy's settings.
2. Configure an F5XC client blocking rule for the specific user agent header value.
3. Rerun the script.
Result: The request is blocked, and F5XC security logs confirm the block due to the client blocking rule.

7. Solution 4: Blocking Entire Bot Categories (Suspicious Bot Blocking)

Process: Configure F5XC WAF to block all suspicious bots.
Demonstration: Rerunning the script results in the request being rejected by F5XC WAF. Security logs indicate the request was blocked because it was identified as coming from a suspicious bot.

8. Solution 5: Blocking Bots Masquerading as Legitimate Clients (TLS Fingerprint and User Agent)

Process:
1. Examine security logs, expand the JSON, and copy the TLS fingerprint of the bot to be blocked.
2. Configure a service policy on the load balancer with two rules:
  - One rule blocks requests matching the specific TLS fingerprint.
  - A second rule allows all other requests.
3. Optionally, add combinations of user agents, paths, etc., to the blocking rule.
4. Save the changes and rerun the script.
Result: The request is blocked. Event logs show the request was rejected due to the service policy.

9. Conclusion

The demo showcased different methods to block AI scraping bots using F5 Distributed Cloud WAF.

Technical Terms:

RFC (Request for Comments): A formal document from the Internet Engineering Task Force (IETF) that defines standards and protocols for the internet.
User Agent: A string of text that identifies the browser and operating system to the web server.
TLS Fingerprint: A unique identifier for a TLS (Transport Layer Security) connection, used to identify specific clients or bots.
Load Balancer: A device or software that distributes network traffic across multiple servers.
Service Policy: A set of rules that define how network traffic is handled.
Direct Response Route: A configuration that allows F5XC to directly respond to a request with a specified response body.