Cloudflare, a publicly traded cloud service provider, has introduced a new free tool designed to stop bots from scraping websites hosted on its platform for AI model training data.
Some AI companies, including Google, OpenAI, and Apple, allow website owners to block their data-scraping bots by updating their site’s robots.txt file, which dictates which pages bots can access. However, as Cloudflare points out, not all AI scrapers respect these directives.
“Customers don’t want AI bots visiting their websites, especially those that do so dishonestly,” Cloudflare writes on its official blog. “We worry that some AI companies, determined to bypass rules, will continually adapt to evade bot detection.”
To tackle this issue, Cloudflare has analyzed AI bot and crawler traffic to improve their automatic bot detection models. These models take into account whether an AI bot is trying to evade detection by mimicking the appearance and behavior of a regular web user.
“When bad actors attempt to crawl websites on a large scale, they generally use tools and frameworks that we can identify,” Cloudflare explains. “Based on these signals, our models can flag traffic from evasive AI bots as bots.”
Cloudflare has also set up a form for hosts to report suspected AI bots and crawlers, and they will continue to manually blacklist these bots over time.
The rise of generative AI has increased the demand for training data, bringing the issue of AI bots into sharper focus. Many sites, wary of AI companies using their content without permission or compensation, have started blocking AI scrapers and crawlers. One study found that around 26% of the top 1,000 sites on the web have blocked OpenAI’s bot; another found that more than 600 news publishers had done the same.
Blocking is not foolproof, however. Some vendors seem to be ignoring standard bot exclusion rules to gain an edge in the AI race. AI search engine Perplexity was recently accused of impersonating legitimate visitors to scrape content from websites, and both OpenAI and Anthropic have reportedly ignored robots.txt rules at times.
In a letter to publishers last month, content licensing startup TollBit stated that it sees “many AI agents” ignoring the robots.txt standard.
Tools like Cloudflare’s could help, but only if they accurately detect covert AI bots. They also won’t resolve the bigger issue of publishers potentially losing referral traffic from AI tools like Google’s AI Overviews, which exclude sites that block specific AI crawlers.
Leave a Reply