Blackhole all their ASN's and be done with it. It sounds like the crawlers are not adding anything back to SourceHut. This exercise takes a while the first time around but take one IP from each CIDR block and look up all the IP's [1][2] they own, aggregate it [3] using a script to summarize into larger CIDR blocks and have a cron "@reboot" or systemd unit file script that performs:
ip route add blackhole "${ClownNet}" # in a loop through a text file
This is how I ended up blocking archive.is along with using a couple MSS rules. waves to their admin here on HN
Obviously don't block anything you have to reach out to or consumer ISP networks in countries you wish to do business with unless you are a bastard operator from hell like me. It's far less CPU than using firewall rules and takes minuscule amounts of kernel memory. Be sure to also have a URL your visitors can go to in order to grab their IP address should they claim they are blocked. Link to this comment in a Jira and test it in a staging environment. Be sure to test all the regular outbound connections that you make in production in your staging environment.
If you do want to block entire countries without having to look something up on every web hit then use this [4] to add to the blackhole script with a little data massaging. Everything in the blackhole text file must be versioned and committed into your internal repository so one can tell when a network added is breaking things and one can quickly roll back. Have a function in the script that deletes all the blackholes.
I blackhole about 27,000 CIDR blocks on a tiny VM web server and the script takes 20 seconds to complete.
Me and my peers the professional community I'm in have been finding it very distributed traffic, evolving to often be thousands of IPs making only a few requests each, often well-distributed over the planet, and not infrequently from ASNs that include consumer traffic. We suspect it's malware controlled botnets. And trying to block by range or ASN frequently seems to result in the traffic simply hopping to new ranges. Blocking ASNs hasn't been working for us.
"LLM crawler" is just a guess, although my best guess too. But what I can say is they appear to be scrapers/crawlers that are just really stupid and not concerned that they're spending lots of time on useless/duplicate content, apparently whoever is running it doesn't care about the cost. (Another thing that would be congruent with it running on stolen hardware botnets).
It has been absurd lately. I have a tiny static blog that I haven't updated in over a year, cgit, and a few other things that I host publicly. Sam Altman has been so excited to browse my blog, it causes the fans to regularly spin up on my home r710. I had to block him in favor of my sanity; hopefully he's not too disappointed he'll be missing my blog post this year :'(
This has become a very widespread problem. I still haven't seen much coverage of it in a comprehensive way.
Many many of us are seeing it.
Blackhole all their ASN's and be done with it. It sounds like the crawlers are not adding anything back to SourceHut. This exercise takes a while the first time around but take one IP from each CIDR block and look up all the IP's [1][2] they own, aggregate it [3] using a script to summarize into larger CIDR blocks and have a cron "@reboot" or systemd unit file script that performs:
This is how I ended up blocking archive.is along with using a couple MSS rules. waves to their admin here on HNObviously don't block anything you have to reach out to or consumer ISP networks in countries you wish to do business with unless you are a bastard operator from hell like me. It's far less CPU than using firewall rules and takes minuscule amounts of kernel memory. Be sure to also have a URL your visitors can go to in order to grab their IP address should they claim they are blocked. Link to this comment in a Jira and test it in a staging environment. Be sure to test all the regular outbound connections that you make in production in your staging environment.
If you do want to block entire countries without having to look something up on every web hit then use this [4] to add to the blackhole script with a little data massaging. Everything in the blackhole text file must be versioned and committed into your internal repository so one can tell when a network added is breaking things and one can quickly roll back. Have a function in the script that deletes all the blackholes.
I blackhole about 27,000 CIDR blocks on a tiny VM web server and the script takes 20 seconds to complete.
[1] - https://bgp.tools/
[2] - https://bgp.he.net/
[3] - https://adrianpopagh.blogspot.com/2008/03/route-summarizatio...
[4] - https://github.com/firehol/blocklist-ipsets/tree/master/ipip...
Me and my peers the professional community I'm in have been finding it very distributed traffic, evolving to often be thousands of IPs making only a few requests each, often well-distributed over the planet, and not infrequently from ASNs that include consumer traffic. We suspect it's malware controlled botnets. And trying to block by range or ASN frequently seems to result in the traffic simply hopping to new ranges. Blocking ASNs hasn't been working for us.
"LLM crawler" is just a guess, although my best guess too. But what I can say is they appear to be scrapers/crawlers that are just really stupid and not concerned that they're spending lots of time on useless/duplicate content, apparently whoever is running it doesn't care about the cost. (Another thing that would be congruent with it running on stolen hardware botnets).
There's more discussion on reddit with other people chiming in with similar experiences: https://old.reddit.com/r/programming/comments/1jdbnq2/llm_cr...
as I write this, there are only 2 other comments in this this HN thread.
It has been absurd lately. I have a tiny static blog that I haven't updated in over a year, cgit, and a few other things that I host publicly. Sam Altman has been so excited to browse my blog, it causes the fans to regularly spin up on my home r710. I had to block him in favor of my sanity; hopefully he's not too disappointed he'll be missing my blog post this year :'(
If the crawler respects 301 status codes you can literally make a party like OpenAI DDOS themselves :>
Disclaimer: This should be considered satire, not advice.
I cannot wait for sites to provide poison results to known LLM's.