How to detect and secure your applications from AI bots using F5 Distributed Cloud WAF
By F5 DevCentral Community
AITechnologyBusiness
Share:
Key Concepts:
- AI Scraping Bots
- Robots.txt directives (User-agent, Disallow, Disallow AI-Training, Allow AI-Training)
- F5 Distributed Cloud Web Application Firewall (F5XC WAF)
- Scrapy (Python scraping library)
- Direct Response Route
- Client Blocking Rule
- Suspicious Bot Blocking
- Service Policy (TLS Fingerprint, User Agent)
1. Introduction to AI Scraping Concerns
- Web scraping is not new, but the use of scraped data to train AI models has increased concerns.
- Generative AI models can now retrieve dynamic content from websites using advancements like RAG (Retrieval-Augmented Generation), AI Agents, and MCP (presumably, Model Control Plane).
2. Robots.txt and AI-Specific Directives
- IETF RFC 9309 defines standard ways to control data access for automated bots using directives in
robots.txt. - A new RFC suggests adding
Disallow AI-Training(prohibits data access for AI training) andAllow AI-Training(permits data fetching for AI training) directives.
3. F5XC WAF Mitigation Solutions
- The demo focuses on five mitigation solutions within F5 Distributed Cloud WAF (F5XC WAF).
4. Solution 1: Preventing Bots Obeying AI Directives
- Tool: Python Scrapy library is used, which allows customization of user agents and robots rules.
- Configuration: Scrapy's settings file is modified to include the new AI directives. Prigggo is used as the default parser.
- Direct Response Route: F5XC is configured to serve a custom
robots.txtfile using a direct response route. This route specifies the content F5XC sends back to the client. - Example: A direct response route is created to serve a
robots.txtfile withDisallow AI-Training. - Demonstration:
- The crawler initially accesses the website content when not obeying robots.txt.
- When Scrapy is instructed to obey robots.txt, access is restricted.
- The
robots.txtis modified toAllow AI-Training, and the crawler is able to access the content.
5. Solution 2: Mitigating Bots Obeying RFC 9309 (But Not AI Directives)
- Bots that obey RFC 9309 but don't support the new AI directives can be mitigated using the standard
Disallowdirective. - Process: Modify the load balancer configuration with the
Disallowdirective. Rerun the script, and the request will be blocked based on therobots obeyfield.
6. Solution 3: Blocking Specific Scraping Tools (User Agent Blocking)
- Scenario: Bots don't obey robots.txt directives, and the user agent header is known.
- Process:
- Update the user agent field in Scrapy's settings.
- Configure an F5XC client blocking rule for the specific user agent header value.
- Rerun the script.
- Result: The request is blocked, and F5XC security logs confirm the block due to the client blocking rule.
7. Solution 4: Blocking Entire Bot Categories (Suspicious Bot Blocking)
- Process: Configure F5XC WAF to block all suspicious bots.
- Demonstration: Rerunning the script results in the request being rejected by F5XC WAF. Security logs indicate the request was blocked because it was identified as coming from a suspicious bot.
8. Solution 5: Blocking Bots Masquerading as Legitimate Clients (TLS Fingerprint and User Agent)
- Process:
- Examine security logs, expand the JSON, and copy the TLS fingerprint of the bot to be blocked.
- Configure a service policy on the load balancer with two rules:
- One rule blocks requests matching the specific TLS fingerprint.
- A second rule allows all other requests.
- Optionally, add combinations of user agents, paths, etc., to the blocking rule.
- Save the changes and rerun the script.
- Result: The request is blocked. Event logs show the request was rejected due to the service policy.
9. Conclusion
- The demo showcased different methods to block AI scraping bots using F5 Distributed Cloud WAF.
Technical Terms:
- RFC (Request for Comments): A formal document from the Internet Engineering Task Force (IETF) that defines standards and protocols for the internet.
- User Agent: A string of text that identifies the browser and operating system to the web server.
- TLS Fingerprint: A unique identifier for a TLS (Transport Layer Security) connection, used to identify specific clients or bots.
- Load Balancer: A device or software that distributes network traffic across multiple servers.
- Service Policy: A set of rules that define how network traffic is handled.
- Direct Response Route: A configuration that allows F5XC to directly respond to a request with a specified response body.
Chat with this Video
AI-PoweredHi! I can answer questions about this video "How to detect and secure your applications from AI bots using F5 Distributed Cloud WAF". What would you like to know?
Chat is based on the transcript of this video and may not be 100% accurate.