LLM crawlers continue to DDoS Sourcehut

(status.sr.ht)

22 points | by todsacerdoti 4 months ago ago

8 comments

LinuxBender 4 months ago
Blackhole all their ASN's and be done with it. It sounds like the crawlers are not adding anything back to SourceHut. This exercise takes a while the first time around but take one IP from each CIDR block and look up all the IP's [1][2] they own, aggregate it [3] using a script to summarize into larger CIDR blocks and have a cron "@reboot" or systemd unit file script that performs:
```
    ip route add blackhole "${ClownNet}" # in a loop through a text file
```
This is how I ended up blocking archive.is along with using a couple MSS rules. waves to their admin here on HN
Obviously don't block anything you have to reach out to or consumer ISP networks in countries you wish to do business with unless you are a bastard operator from hell like me. It's far less CPU than using firewall rules and takes minuscule amounts of kernel memory. Be sure to also have a URL your visitors can go to in order to grab their IP address should they claim they are blocked. Link to this comment in a Jira and test it in a staging environment. Be sure to test all the regular outbound connections that you make in production in your staging environment.
If you do want to block entire countries without having to look something up on every web hit then use this [4] to add to the blackhole script with a little data massaging. Everything in the blackhole text file must be versioned and committed into your internal repository so one can tell when a network added is breaking things and one can quickly roll back. Have a function in the script that deletes all the blackholes.
I blackhole about 27,000 CIDR blocks on a tiny VM web server and the script takes 20 seconds to complete.
[1] - https://bgp.tools/
[2] - https://bgp.he.net/
[3] - https://adrianpopagh.blogspot.com/2008/03/route-summarizatio...
[4] - https://github.com/firehol/blocklist-ipsets/tree/master/ipip...
[-]
- jrochkind1 4 months ago
  Me and my peers the professional community I'm in have been finding it very distributed traffic, evolving to often be thousands of IPs making only a few requests each, often well-distributed over the planet, and not infrequently from ASNs that include consumer traffic. We suspect it's malware controlled botnets. And trying to block by range or ASN frequently seems to result in the traffic simply hopping to new ranges. Blocking ASNs hasn't been working for us.
  "LLM crawler" is just a guess, although my best guess too. But what I can say is they appear to be scrapers/crawlers that are just really stupid and not concerned that they're spending lots of time on useless/duplicate content, apparently whoever is running it doesn't care about the cost. (Another thing that would be congruent with it running on stolen hardware botnets).
  [-]
  - LinuxBender 4 months ago
    Botnets would not surprise me. By chance can you post one of the log lines for them? Any unique characteristics at layer 7 or even 3/4? Are they using HTTP/1.1 or lower? If so try dropping anything that is not HTTP/2.0 when you are under attack. This can break unmaintained API clients. All major browsers support HTTP/2.0 whereas many old bots do not.
```
    if ($server_protocol != HTTP/2.0) { return 444; }
```
    Another thing to check for is cors and navigate headers which bots often lack (Nginx example):
```
    if ($http_sec_fetch_mode !~ (cors|no-cors|navigate) ) { return 444; }
```
    If you run:
```
    tcpdump -p --dont-verify-checksums -i any -NNnnttvvv -B32768 -c16384 -s0 proto 6 and port 443 and 'tcp[13] == 2'
```
    do you see nominal MSS values such as 1280 to 1460? If not try dropping any SYN packets that are either missing MSS or have a strange value. In Netfilter IPTables in the raw table it looks like:
```
    iptables -t raw -I PREROUTING -d ${My_IP} -i eth0 -p tcp -m tcp --dport 443 --tcp-flags FIN,SYN,RST,ACK SYN -m tcpmss ! --mss 1280:1460 -j DROP
```
    Test these things in a staging environment of course. These things assume you have a device at the edge that can do something similar or that you are running iptables on the node itself and that gets into topics of state table size, stateless rules, etc... Example from the raw tables
```
    -A PREROUTING -i lo -j NOTRACK
    -A PREROUTING -i eth0 -p tcp -m tcp --dport 443 -j NOTRACK
    -A OUTPUT -o lo -j NOTRACK
    -A OUTPUT -o eth0 -p tcp -m tcp --sport 443 -j NOTRACK
```
    Otherwise the attacker can fill your state tables even if you increase the size.
    Another common option is to require an account and then regardless of IP address limit how many requests per (second/minute/hour/day) can be made by said account and limit that account to existing in less than 3 CIDR blocks per day or perhaps 3 /16's per day and then make it harder to botters to mass create accounts. All of this should be adjustable on the fly by DEFCON status. If you are not under attack DEFCON 5, ease the restrictions so the site is more usable and friendly. If under attack DEFCON 1 then rules and account restrictions are tightly enforced but still usable by real people using the site the way it was intended.
jrochkind1 4 months ago
This has become a very widespread problem. I still haven't seen much coverage of it in a comprehensive way.
Many many of us are seeing it.
flanbiscuit 4 months ago
There's more discussion on reddit with other people chiming in with similar experiences: https://old.reddit.com/r/programming/comments/1jdbnq2/llm_cr...
as I write this, there are only 2 other comments in this this HN thread.
Zambyte 4 months ago
It has been absurd lately. I have a tiny static blog that I haven't updated in over a year, cgit, and a few other things that I host publicly. Sam Altman has been so excited to browse my blog, it causes the fans to regularly spin up on my home r710. I had to block him in favor of my sanity; hopefully he's not too disappointed he'll be missing my blog post this year :'(
[-]
- beached_whale 4 months ago
  I cannot wait for sites to provide poison results to known LLM's.
- mindcrash 4 months ago
  If the crawler respects 301 status codes you can literally make a party like OpenAI DDOS themselves :>
  Disclaimer: This should be considered satire, not advice.