AI crawlers haven't learned to play nice with websites

(theregister.com)

77 points | by belter 4 months ago ago

46 comments

blueflow 4 months ago
I set up a robots.txt with one entry: a disallow for a specific html file. The html file is not listed elsewhere and thus can only be discovered by the robots.txt. The file features random stuff and an <a href=""> element with the nofollow attribute and a link target set. That link has no text contents, no size, and thus doesn't occupy pixels that a user could click on. When clicked, that link would report the http request to me.
At 2025-02-22 02:14:52, less than a day after i set up this honeypot, 66.249.68.37 (AS15169 GOOGLE) made a request on that invisible link, using user agent string "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.6943.53 Mobile Safari/537.36 (compatible; GoogleOther)"
So now i have hard evidence that google runs bots that ignore the rules from the robots.txt, but still parses urls from there, queries these resources anyways and follows links marked as "nofollow".
[-]
- bayindirh 4 months ago
  It's already well known that none of the search engines, crawlers, and companies honor robots.txt (anymore).
  It's just for us oldies who prefer the ethical web where people respected the RFCs they have collectively written.
  [-]
  - pimlottc 4 months ago
    It’s one thing to know they don’t honor it, it’s another to know they intentionally abuse it
    [-]
    - johnea 4 months ago
      Yes, it is, and now we know both of those things don't we.
      What will be the consequence? My risky prediction: Absolutely Nothing, for neither goggle nor any of the other miscreants running the bot nets.
      And just to take a poke at the title of the original El Reg article: Do you really think the problem is that they just haven't "learned" yet how to "play nice"? I'm sure the great oracle goggle, just needs someone to poperly splain it to them.
- extraduder_ire 4 months ago
  Did any other crawlers hit it? Would be nice to have someone keeping track of this over time and publishing the data online somewhere.
  Probably need to collect logs across multiple sites though, as some crawlers would exclude your domain once they knew.
  [-]
  - blueflow 4 months ago
    Only one other hit,
    2025-03-05 13:32:09 45.89.148.57 (AS46844 SHARKTECH) "Mozilla/5.0 (Windows; U; Windows NT 6.1; pt-BR; rv:1.9.2.18) Gecko/20110614 Firefox/3.6.18 (.NET CLR 3.5.30729)"
- acdw 4 months ago
  could you share the code for that? thinking I could set up a fail2ban rule for these ips
  [-]
  - blueflow 4 months ago
    its a pathetic piece of php code:
```
  $db=...
  $sth=$db->prepare("INSERT INTO beacon (time, addr, query, agent, referer) VALUES (FROM_UNIXTIME(?), ?, ?, ?, ?)");
  $sth->execute([
    $_SERVER["REQUEST_TIME"],
    $_SERVER["REMOTE_ADDR"],
    $_SERVER["QUERY_STRING"],
    isset($_SERVER["HTTP_USER_AGENT"])?$_SERVER["HTTP_USER_AGENT"]:NULL,
    isset($_SERVER["HTTP_REFERER"])?$_SERVER["HTTP_REFERER"]:NULL
  ]);
```
    [-]
    - bbkane 4 months ago
      Pathetic? Looks nice and straightforward to me!
esskay 4 months ago
I manage a few thousand client sites spread across a bunch of differing infrastructures but one common thing we have on all of them is all the specific robots rules to block ai crawlers.
Not a single one of them respects that request anymore. Anthropic is the worst at the moment. It starts hitting the site with an anthropic useragent. You can litterally see it hit the robots file, then it changes it's useragent to a generic browser one and carries on.
I say at the moment because they're all doing this kind of crap, they seem to take it in turns at ramping up hammering servers.
[-]
- pjc50 4 months ago
  > Not a single one of them respects that request anymore
  AI companies have seen the prisoner's dilemma of good internet citizenship and slammed hard on the "defect" button. They plan to steal your content, put it in an attribution-removing blender, then sell it to other people. With the explicit purpose of replacing human contribution on the internet and in the workplace. After all, they represent a reality-distorting amount of capital, so why should any rules apply to them?
- onion2k 4 months ago
  I say at the moment because they're all doing this kind of crap...
  Maybe they used AI to code their agent, and it's just not that good.
depingus 4 months ago
This headline sucks. It suggest that AI crawlers' bad behavior isn't intentional. Robots.txt has existed for ages, and crawler operators know exactly how to play nice. They simply choose not to. Media outlets need to stop attributing ignorance or incompetence to what is clearly willful action. These megacorps need to be held accountable, even on the little things.
[-]
- soco 4 months ago
  Technically it's true - they have not learned, but because nobody taught them to. Because nobody slapped any over-reaching wrist, ever, and if somebody even just thinks about, then it's all "damn EU bureaucrats".
- freeAgent 4 months ago
  I’m not sure whether you’re familiar with the Register or not, but they definitely get it and aren’t being nice. They’re being sarcastic.
- mrweasel 4 months ago
  > It suggest that AI crawlers' bad behaviour isn't intentional.
  The robots.txt is only part of the problem, simply overloading sites is another. Maybe you don't really mind that your site is getting scraped, but you do mind that it crashes due to poorly written scrapers.
  It was always understood that your scrappers should limit their load to something that a site can handle. Google has always excelled at this, I've never experienced the GoogleBot being an issue, nor the BingBot. The developers who work at the AI companies however are less talented and less careering. If their bot crashes a site, meeeh, try again later.
  I don't think they DDoS sites deliberately, they just don't have the skills to fix their shitty scrapper.
  If we want the problem to go away, ask AWS, Azure, GCP, Alibaba Cloud and others for a way to report bad behaviour. Make their hosting providers take action.
  [-]
  - danaris 4 months ago
    I will go one step further:
    It doesn't matter whether they have the skills or not, because they don't care.
    As long as they can siphon up enough data to increase their revenue or stock price by 0.001¢, they're perfectly content to leave small independent websites shattered and unusable under the weight of their aggressive, unrelenting scraper bots.
    I'm firmly of the opinion that they should all be considered hostile, predatory companies, regardless of what supposed "value" they bring.
    Train your river-guzzling models with data that's paid for or given voluntarily and then we can talk about your "value".
leonidasv 4 months ago
This is going to drive websites to put even more Cloudflare captcha-walls before you can even see the content. Web will become less accessible.
I have a domain parked with just a blank page that's getting 172,000 requests per month. Since it's on a free-tier static hosting, that costs me nothing. However, if that trend continues, I don't know how long we will have those free static hosting options available for everyone.
[-]
- deadbabe 4 months ago
  Send a zip bomb
  [-]
  - jonatron 4 months ago
    It didn't seem to matter what response I sent to PerplexityBot, the requests kept coming. I guess it's not surprising that a badly written bot isn't monitored either.
    [-]
    - Dwedit 4 months ago
      There was the tarpit that sends out markov-chain-generated random words, loading very slowly . If the client has a limited number of concurrent requests, that could theoretically choke the client.
    - JohnFen 4 months ago
      > a badly written bot
      I think that it's fair to say that these bots aren't badly written. Their behavior is entirely intentional.
    - reginald78 4 months ago
      Can you use hashcash or some proof of work system somehow? I don't know anything about this space.
      [-]
      - jonatron 4 months ago
        They were only requesting the HTML, not using a headless browser.
  - 4 months ago
    [deleted]
cdblades 4 months ago
Is there any indication that the owners of those crawlers intend for them to play nice?
I think this form the article is telling:
> The Register asked Schubert about this in early January. "Funnily enough, a few days after the post went viral, all crawling stopped," he responded at the time. "Not just on the Diaspora wiki, but on my entire infrastructure. I'm not entirely sure why, but here we are."
So the owners of those crawlers certainly have the ability to stop the traffic, they're just choosing not too...until publicity makes their actions risky.
yodon 4 months ago
As someone who has done a tiny bit of scraping, I've been surprised how difficult it is to be a good citizen.
Are there packages you know of in the major language ecosystems that you can easily point at a both a robots.txt and an index.html (you need both to be compliant, most packages I've found only look at robots.txt) to get an answer on what you can and can't do at what rate?
The only one I'm aware of is a C# package which still requires you to do a lot of the heavy lifting[0].
[0]https://github.com/TurnerSoftware/RobotsExclusionTools
[-]
- mrweasel 4 months ago
  You obviously should read the robots.txt and follow it, but it's just as much about not just absolutely hammering a site over and over again, 24-7. Your scraper needs to learn about update patterns and adjust it's scraping interval. It needs to back off if the respons times increases.
  People are bringing up robots.txt because they want to ban AI scrapers, but if those same scrapers didn't constantly pound sites into the ground, it would be less of an issue.
- cavisne 4 months ago
  https://github.com/google/robotstxt
  [-]
  - yodon 4 months ago
    You're proving my point. That's a reference C++ parser that only parses robots.txt files. It doesn't parse meta tags and doesn't help with writing crawlers running in any of the major language ecosystems used by crawler authors.
    It's much more difficult than it should be for scrapers who want to be compliant to to actually be compliant.
entropi 4 months ago
The owners don't even care about copyright and their deep-pocketed owners. They will certainly not care about playing nice with random websites, or any conventions that may exist in any domain.
EdiX 4 months ago
If search engines were invented today they wouldn't stand a chance. People would complain that they can return immoral results. Different sets of people would complain that they can return links to various isms that offend them. The owners would be worried that they return results critical of themselves or their friends. Copyright owners would sue them for linking to piracy. Website owners would block them for reading their precious content.
shprd 4 months ago
That's intentional.
They take the data and deal with the consequences later because there's none so far. I hope data poisoning see more traction as a possible countermeasure.
[-]
- 01HNNWZ0MV43FF 4 months ago
  I wonder what their data cleaning process looks like. If I put a lot of furry porn on my website, would it get excluded from the data sets?
  [-]
  - mtvrs 4 months ago
    Probably only those data sets which do not include furry porn but it's also possible that it could be used in negative data set.
CraigRood 4 months ago
I've personally noticed loads of crawls too, but what has been interesting is they have been crawling effectively the same pages, over and over, for months. Also, they don't seem to be caching DNS at all.
I guess I'm in the minority that I don't care too much about crawlers, even the use of my work for AI, but I think the current hit rates just can't be justified.
4 months ago
[deleted]
ohgr 4 months ago
I just shut my entire web site down after 27 years this year. All traffic was AI crawlers. Fuck it!
[-]
- bayindirh 4 months ago
  Why not install Nepenthes and let them bath in the tar?
  [-]
  - ohgr 4 months ago
    Spend more time and money? Nope. It's on CloudFront + S3 at the moment.
elpocko 4 months ago
>SourceHut says it's getting DDoSed by LLM bots
How do they know it's LLM "bots"? Did they provide any proof? Is it crawlers or automated user agents? The guy behind SH hates all "AI" and is vocal about it; it would be reassuring to see some evidence, all I've seen is them saying "it's those pesky AI bots!"
josefritzishere 4 months ago
I just update my robots.txt to block AI crawlers just so I cna join the future inevitable class action lawsuit.
dmje 4 months ago
Compared to others in the comments (with "a few thousand sites"!!) we're obviously small beer, with maybe 50 or so - but we are being botted to death and it's driving us fucking nuts :-/
tempodox 4 months ago
If you rip everyone off anyway, why would you bother doing it “the nice way”?
boredhacker3 4 months ago
Why don’t they use an anti-bot solution like Cloudflare? It won’t block everything entirely, but it will at least stop low-quality bots
4 months ago
[deleted]