What I like about this approach is that it quietly reframes the problem from “detect AI” to “make abusive access patterns uneconomical”. A simple JS+cookie gate is basically saying: if you want to hammer my instance, you now have to spin up a headless browser and execute JS at scale. That’s cheap for humans, expensive for generic crawlers that are tuned for raw HTTP throughput.
The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.
Yeah, this is where I landed a while ago. What problem am I _really_ trying to solve?
For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.
For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.
I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:
Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.
The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.
It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.
If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
> If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.
Having to use a browser to crawl your site will slow down naive crawlers at scale.
But it wouldn't do much against individuals typing "what is a kumquat" into their local LLM tool that issues 20 requests to answer the question. They're not really going to care nor notice if the tool had to use a playwright instance instead of curl.
Yet it's that use-case that is responsible for ~all of my AI bot traffic according to Cloudflare which is 30x the traffic of direct human users. In my case, being a forum, it made more sense to just block the traffic.
Maybe a stupid question but how can Cloudflare detect what portion of traffic is coming from LLM agents? Do agents identify themselves when they make requests? Are you just assuming that all playwright traffic originated from an agent?
I'm curious about whether there are well coded AI scrapers that have logic for "aha, this is a git forge, git clone it instead of scraping, and git fetch on a rescrape". Why are there apparently so many naive (but still coded to be massively parallel and botnet like, which is not naive in that aspect) crawlers out there?
I'm not an industry insider and not the source of this fact, but it's been previously stated that traffic costs to fetch the current data for each training run is cheaper then caching it in any way locally - wherever it's a git repo, static sites or any other content available through http
If they're handling it as “website, don't care” (because they're training on everything online) they won't know.
If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.
It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.
The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.
I'd see this as coming down to incentive. If you can scrape naively and it's cheap, what's the benefit to you in doing something more efficient for git forge? How many other edge cases are there where you could potentially save a little compute/bandwidth, but need to implement a whole other set of logic?
Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.
Another tangent: there probably are better behaved scrapers, we just don't notice them as much.
True, and it doesn't get mentioned enough. These supposedly world-changing advanced tech companies sure look sloppy as hell from here. There is no need for any of this scraping.
I really don't know how effective my little system would be against these scrapers, but I've setup a system that blocks IP addresses if they've attempted to connect to ports on my system(s) behind which there are no services, and therefore their connections must be 'uninvited', which I classify as malicious.
Since I do actually host a couple of websites / services behind port 443, it means I can't just block everything that tries to scan my ip address at port 443. However, I've setup Cloudflare in front of those websites, so I do log and block any non-Cloudflare (using Cloudflare's ASN: 13335) traffic coming into port 443.
I also log and block IP address attempting to connect on port 80, since that essentially deprecated.
This, of course, does not block traffic coming via the DNS names of the sites, since that will be routed through Cloudflare - but as someone mentioned, Cloudflare has its own anti-scraping tools. And then as another person mentioned, this does require the use of Cloudflare, which is a vast centralising force on the Internet and therefore part of a different problem...
I don't currently split out a separate list for IP addresses that have connected to HTTP(S) ports, but maybe I'll do that over Christmas.
Apologies if the README is a bit rambling. It's evolved over time, and it's mostly for me anyway.
P.S. I always thought it was Yog Sothoth (not Sototh). Either way, I'm partial to Nyarlathotep. "The Crawling Chaos" always sounded like the coolest of the elder gods.
Regarding the Cloudflare part of this, I’d recommend taking a look at “Authenticated Origin Pulls”. It lets you perform your validation at the TLS layer instead of doing it with IP ACLs if that interests you.
We ran into similar issues with aggressive crawling. What helped was rate limiting combined with making intent explicit at the entry point, instead of letting requests fan out blindly. It reduced both load and unexpected edge cases.
I'm having lots of connections every day from Singapor. It's now the main country... despite the whole website being French-only. AI crawlers, for sure.
Amazonbot does this despite my efforts in robots.txt to help it out. I look at all the Singapore requests and they’re Amazonbot trying to get various variants of the Special:RecentChanges page. You’re wasting your time, Amazonbot. I’m trying to help you.
Fun fact: you don't get rid of them even when you put a captcha on all visitors from Singapore. I still see a spike in traffic that perfectly matches the spike in served captchas, but this time it's geographically distributed between places like Iraq, Bangladesh and Brazil.
Hopefully it at least costs them a little bit more.
Usually, there are multiple layers of different counter-protection measures. If you block by country, they shift to different IP ranges, if you block by IP, they might use a new IP for every request, and escalate further depending on the bot owner and your actions.
Yeah same for my Gitea instance. These were all ByteDance and Tencent ASNs from some AWS-equivalent. Blocked the whole subnet belonging to them in my server's ufw and haven't had any problems since then. Same for Vultr and Google Cloud.
Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing?
> At some point haven't you scraped the whole thing?
Git forges will expose a version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits.
Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious.
> How come even small sites get hammered constantly?
Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly (even when doing so carries some SEO costs) so that they're the only ones that can train AI on their own data.
So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence.
No I meant like, if you have a blog with 10 posts.. do they just scrape the same 10 pages thousands of times?
Because people are reporting constant traffic, which would imply that the site is being scraped millions of times per year. How does that make any sense? Are there millions of AI companies?
Basically the scrappers do not bother to cache your website or if they do, with an insanely low ttl. Also they do not specialize the content. So the worst hit sites are something like git hosting due the bfs style scrape (every link). The worst part is alot of this is done via tunneling so ip can be different each time or from residential ops. Which makes it annoying.
It isn't only companies, it is a mass social movement. Anyone with basic coding experience can download some basic learning apparatus and start feeding it material. The latest LLMs make it extremely easy to compose code that scrapes internet sites, so only the most minimal skills are required. Because everything is "AI" now aspiring young people are encouraged to do this in order to gain experience so they can get jobs and a careers in the new AI driven economy.
May be the teams developing AI crawlers are dogfooding & are using the AI itself(and its small context) to keep track of the sites that are already scraped. /s
I think what gets lost in this is that we should expect a lot more traffic from AI if simply for the reason that if I ask AI to answer my question it will do a lot more work and fetch from a lot of websites in generating a reply to me. And yes searching over git repos will absolutely be part of that.
This is all "legitimate" traffic in that it isn't about crawling the internet but in service of a real human.
Put another way, search is moving from a model of crawl the internet and query on cached data to being able to query on live data.
I agree and I think that everyone agreeing or disagreeing with you (and sysadmins everywhere) would be perfectly fine with these AI crawlers (well, mostly...) if these corporations wrote them properly, followed best practices and standards, and didn't effectively DDoS servers or pretend to be what they aren't. Because that is, ultimately, what these AI companies are: very effective, for-sale, legal DDoSers. But they are not written properly, do not follow best practices and standards, and DDoS everything you aim them at, and even go as far as pretending that they're things they aren't, hide behind residential IP addresses (which I'm pretty sure could potentially be illegal because, you know, that risks getting people who have no idea what AI even is in trouble), etc. I don't think AI will replace search now just because so much of the world is blocked from them now, and that is only to increase I'm sure. And, honestly, I doubt there is anything these AI companies could do to make sysadmins actually trust them again anymore.
But when it comes to git repos, an LLM agent like claude code can just clone them for local crawling which is far better than crawling remotely, and it's the "Right Way" for various reasons.
Frankly I suspect AI agents will push search in the opposite direction from your comment and move us to distributed cache workflows. These tools just hit the origin because it's the easy solution of today, not because the data needs to be up to date to the millisecond.
Imagine a system where all those Fetch(url) invocations interact with a local LRU cache. That'd be really nice, and I think that's where we'd want to go, especially once more and more origin servers try to block automated traffic.
Seems like a good way to waste tons of your bandwidth. Almost every serious data pipeline has some quality filtering in there (even open-source ones like FineWeb and EduWeb). And the stuff Iocaine generates instantly gets filtered.
Feel free to test this with any classifier or cheapo LLM.
You shouldn't really serve aggressive scrapers any kind of error or otherwise unusual response, because they'll just take that as a signal to try again with a different IP address or user agent, or a residential proxy, or a headless browser, or whatever else. There's no obligation to be polite to rude guests, give them a 200 OK containing the output of a Markov chain trained on the Bee Movie script instead.
I believe there is a slight misunderstanding regarding the role of 'AI crawlers'.
Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.
This is approximately 30–50% of traffic on any website.
I don't see how an AI crawler is different from any others.
The simplest approach is to count the UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those are rules we already have out of the box.
It's open source, there's no pain in writing specific rules for rate limiting, thus my question.
Plus, we have developed a dashboard for manually choosing UA blocks based on name, but we're still not sure if this is something that would be really helpful for website operators.
My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.
There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.
I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.
> There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant.
It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.
Sure but how much bandwidth is that actually? Of course if your normal traffic is pretty low, it's easy for bot traffic to multiply that by 5, but it doesn't mean it's actually a problem.
The market price for bandwidth in a central location (USA or Europe) is around $1-2 per TB and less if you buy in bulk. I think it's somewhat cheaper in Europe than in the USA due to vastly stronger competition. Hetzner includes 20TB outgoing with every Europe VPS plan, and 1€/TB +VAT overage. Most providers aren't quite so generous but still not that bad. How much are you actually spending?
Maybe it is fast enough but my objection is mostly due to the gross inefficiency of crawlers. Requesting downloads of whole repositories over and over, leading to storing these archives on disk wasting CPU cycles to create them and storage space to retain them, and bandwidth to sent them over the wire. Add this to the gross power consumption of AI and hogging of physical compute hardware, and it is easy to see “AI” as wasteful.
We should just have some standard for crawlable archived versions of pages with no back end or DB interaction behind them etc., for example if there's a reverse proxy, whatever it outputs is archived and it wouldn't actually pass on any call in the archive version. Same for translating the output of any dynamic JS into fully static HTML. Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.
Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.
Or if PoW was a proper web standard with no JS, then ppl who want to tell AI and other crawlers to fuck off, they could at least make it uneconomical to crawl their stuff en masse. In my view, proof of work that would work through headers in the current day world should be as ubiquitous as TLS.
I'm glad the author clarified he wants to prevent his instance from crashing not simply "block robots and allow humans".
I think the idea that you can block bots and allow humans is fallacious.
We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.
Also, perhaps I'm biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.
I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).
The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.
The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.
I'm so fascinated by replies like this, it's too random and nonsensical to be a language barrier issue, but it also does not pattern match into LLM generated text. Reminds me of ~2010 era wordpress comment spam.
recently I just noticed github trying(but failed) to charge the self host runners, I find a afternoon to setup a mini PC to install freeBSD and gitaea on it, then setup tailscale to let it only listen on the 100.64.x.x IP address.
Since I do not make this node public accessable, so no worry for AI web crawlers:)
What I like about this approach is that it quietly reframes the problem from “detect AI” to “make abusive access patterns uneconomical”. A simple JS+cookie gate is basically saying: if you want to hammer my instance, you now have to spin up a headless browser and execute JS at scale. That’s cheap for humans, expensive for generic crawlers that are tuned for raw HTTP throughput.
The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.
Yeah, this is where I landed a while ago. What problem am I _really_ trying to solve?
For some people it's an ideological one--we don't want AI vacuuming up all of our content. For those, "is this an AI user?" is a useful question to answer. However it's a hard one.
For many the problem is simply "there are a class of users that are putting way too much load on the system and it's causing problems". Initially I was playing wack-a-mole with this and dealing with alerts firing on a regular basis because of Meta crawling our site very aggressively, not backing off when errors were returned, etc.
I looked at rate limiting but the work involved in distributed rate limiting versus the number of offenders involved made the effort look a little silly, so I moved towards a "nuke it from orbit" strategy:
Requests are bucketed by class C subnet (31.13.80.36 -> 31.13.80.x) and request rate is tracked over 30 minute windows. If the request rate over that window exceeds a very generous threshold I've only seen a few very obvious and poorly behaved crawlers exceed it fires an alert.
The alert kicks off a flow where we look up the ASN covering every IP in that range, look up every range associated with those ASNs, and throw an alert in Slack with a big red "Block" button attached. When approved, the entire ASN is blocked at the edge.
It's never triggered on anything we weren't willing to block (e.g., a local consumer ISP). We've dropped a handful of foreign providers, some "budget" VPS providers, some more reputable cloud providers, and Facebook. It didn't take long before the alerts stopped--both for high request rates and our application monitoring seeing excessive loads.
If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
> If anyone's interested in trying to implement something similar, there's a regularly updated database of ASN <-> IP ranges announced here: https://github.com/ipverse/asn-ip
What exactly is the source of these mappings? Never heard about ipverse before, seems to be a semi-anonymous GitHub organization and their website has had a failing certificate for more than a year by now.
whois (delegation files) according to the embedded blog post, eg https://ftp.arin.net/pub/stats/arin/delegated-arin-extended-...
You ban the ASN permanently in this scenario?
i don't know. use PAT. the long term solution is web environment integrity by another name.
It depends what your goal is.
Having to use a browser to crawl your site will slow down naive crawlers at scale.
But it wouldn't do much against individuals typing "what is a kumquat" into their local LLM tool that issues 20 requests to answer the question. They're not really going to care nor notice if the tool had to use a playwright instance instead of curl.
Yet it's that use-case that is responsible for ~all of my AI bot traffic according to Cloudflare which is 30x the traffic of direct human users. In my case, being a forum, it made more sense to just block the traffic.
Maybe a stupid question but how can Cloudflare detect what portion of traffic is coming from LLM agents? Do agents identify themselves when they make requests? Are you just assuming that all playwright traffic originated from an agent?
I'm curious about whether there are well coded AI scrapers that have logic for "aha, this is a git forge, git clone it instead of scraping, and git fetch on a rescrape". Why are there apparently so many naive (but still coded to be massively parallel and botnet like, which is not naive in that aspect) crawlers out there?
I'm not an industry insider and not the source of this fact, but it's been previously stated that traffic costs to fetch the current data for each training run is cheaper then caching it in any way locally - wherever it's a git repo, static sites or any other content available through http
This seems nuts and suggests maybe the people selling AI scrapers their bandwidth could get away with charging rather more than they do :)
If they're handling it as “website, don't care” (because they're training on everything online) they won't know.
If they're treating it specifically on “code forge” (because they're after coding use cases), there's lots of interesting information that you won't get by just cloning a repo.
It's not just the current state of the repo, or all commits (and their messages). It's the initial issue (and discussion) that lead to a pull request (and review comments) that eventually gets squashed into a single commit.
The way you code with an agent is a lot more similar to the: issue, comments, change, review, refinement sequence; that you get by slurping the website.
I'd see this as coming down to incentive. If you can scrape naively and it's cheap, what's the benefit to you in doing something more efficient for git forge? How many other edge cases are there where you could potentially save a little compute/bandwidth, but need to implement a whole other set of logic?
Unfortunately, this kind of scraping seems to inconvenience the host way more than the scraper.
Another tangent: there probably are better behaved scrapers, we just don't notice them as much.
True, and it doesn't get mentioned enough. These supposedly world-changing advanced tech companies sure look sloppy as hell from here. There is no need for any of this scraping.
I guess they're vibe coded :D
what's next: you can only read my content after mining btc and wiring it to $wallet->address
I really don't know how effective my little system would be against these scrapers, but I've setup a system that blocks IP addresses if they've attempted to connect to ports on my system(s) behind which there are no services, and therefore their connections must be 'uninvited', which I classify as malicious.
Since I do actually host a couple of websites / services behind port 443, it means I can't just block everything that tries to scan my ip address at port 443. However, I've setup Cloudflare in front of those websites, so I do log and block any non-Cloudflare (using Cloudflare's ASN: 13335) traffic coming into port 443.
I also log and block IP address attempting to connect on port 80, since that essentially deprecated.
This, of course, does not block traffic coming via the DNS names of the sites, since that will be routed through Cloudflare - but as someone mentioned, Cloudflare has its own anti-scraping tools. And then as another person mentioned, this does require the use of Cloudflare, which is a vast centralising force on the Internet and therefore part of a different problem...
I don't currently split out a separate list for IP addresses that have connected to HTTP(S) ports, but maybe I'll do that over Christmas.
This is my current simple project: https://github.com/UninvitedActivity/UninvitedActivity
Apologies if the README is a bit rambling. It's evolved over time, and it's mostly for me anyway.
P.S. I always thought it was Yog Sothoth (not Sototh). Either way, I'm partial to Nyarlathotep. "The Crawling Chaos" always sounded like the coolest of the elder gods.
Some scrapers/scanners use residential IPs. Aren't you worried you'll end up blocking legitimate traffic?
Regarding the Cloudflare part of this, I’d recommend taking a look at “Authenticated Origin Pulls”. It lets you perform your validation at the TLS layer instead of doing it with IP ACLs if that interests you.
We ran into similar issues with aggressive crawling. What helped was rate limiting combined with making intent explicit at the entry point, instead of letting requests fan out blindly. It reduced both load and unexpected edge cases.
What do you mean by "making intent explicit at the entry point"?
I'm having lots of connections every day from Singapor. It's now the main country... despite the whole website being French-only. AI crawlers, for sure.
Thanks for this tip.
Amazonbot does this despite my efforts in robots.txt to help it out. I look at all the Singapore requests and they’re Amazonbot trying to get various variants of the Special:RecentChanges page. You’re wasting your time, Amazonbot. I’m trying to help you.
Did you check IP address of this UA?
Yeah, a while ago, they're all Singapore reporting Amazonbot. Here is an example request:
The actual IP is in X-Forwarded-For and I didn't keep that.Fun fact: you don't get rid of them even when you put a captcha on all visitors from Singapore. I still see a spike in traffic that perfectly matches the spike in served captchas, but this time it's geographically distributed between places like Iraq, Bangladesh and Brazil.
Hopefully it at least costs them a little bit more.
Usually, there are multiple layers of different counter-protection measures. If you block by country, they shift to different IP ranges, if you block by IP, they might use a new IP for every request, and escalate further depending on the bot owner and your actions.
Yeah same for my Gitea instance. These were all ByteDance and Tencent ASNs from some AWS-equivalent. Blocked the whole subnet belonging to them in my server's ufw and haven't had any problems since then. Same for Vultr and Google Cloud.
Oh hey, I wrote the "you don't need anubis" post you (or the post author, if that's not you) got inspiration from! Glad to hear it helped!
Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing?
> At some point haven't you scraped the whole thing?
Git forges will expose a version of every file at every commit in the project's history. If you have medium sized project consisting of say 1000 files and 10,000 commits, the crawler will identify a number of URLs on the same order of magnitude as English Wikipedia, just for that one project. This is also very expensive for the git forge, as it needs to reconstruct the historical files from a bunch of commits.
Git forges interact spectacularly poorly with naively implemented web crawlers, unless the crawlers put in logic to avoid exhaustively crawling git forges. You honestly get a pretty long way just excluding URLs with long base64-like path elements, which isn't hard but it's also not obvious.
> How come even small sites get hammered constantly?
Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly (even when doing so carries some SEO costs) so that they're the only ones that can train AI on their own data.
So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence.
No I meant like, if you have a blog with 10 posts.. do they just scrape the same 10 pages thousands of times?
Because people are reporting constant traffic, which would imply that the site is being scraped millions of times per year. How does that make any sense? Are there millions of AI companies?
Basically the scrappers do not bother to cache your website or if they do, with an insanely low ttl. Also they do not specialize the content. So the worst hit sites are something like git hosting due the bfs style scrape (every link). The worst part is alot of this is done via tunneling so ip can be different each time or from residential ops. Which makes it annoying.
AI companies scrape to:
- have data to train on
- update the data more or less continuously
- answer queries from users on the fly
With a lot of AI companies, that generates a lot of scraping. Also, some of them behave terribly when scraping or is just bad at it.
Why don’t they scrape once though?
1) It may be out of date 2) Storing it costs money
It's not just companies either, a lot of people run crawlers for their home lab projects too.
It isn't only companies, it is a mass social movement. Anyone with basic coding experience can download some basic learning apparatus and start feeding it material. The latest LLMs make it extremely easy to compose code that scrapes internet sites, so only the most minimal skills are required. Because everything is "AI" now aspiring young people are encouraged to do this in order to gain experience so they can get jobs and a careers in the new AI driven economy.
May be the teams developing AI crawlers are dogfooding & are using the AI itself(and its small context) to keep track of the sites that are already scraped. /s
I think what gets lost in this is that we should expect a lot more traffic from AI if simply for the reason that if I ask AI to answer my question it will do a lot more work and fetch from a lot of websites in generating a reply to me. And yes searching over git repos will absolutely be part of that.
This is all "legitimate" traffic in that it isn't about crawling the internet but in service of a real human.
Put another way, search is moving from a model of crawl the internet and query on cached data to being able to query on live data.
I agree and I think that everyone agreeing or disagreeing with you (and sysadmins everywhere) would be perfectly fine with these AI crawlers (well, mostly...) if these corporations wrote them properly, followed best practices and standards, and didn't effectively DDoS servers or pretend to be what they aren't. Because that is, ultimately, what these AI companies are: very effective, for-sale, legal DDoSers. But they are not written properly, do not follow best practices and standards, and DDoS everything you aim them at, and even go as far as pretending that they're things they aren't, hide behind residential IP addresses (which I'm pretty sure could potentially be illegal because, you know, that risks getting people who have no idea what AI even is in trouble), etc. I don't think AI will replace search now just because so much of the world is blocked from them now, and that is only to increase I'm sure. And, honestly, I doubt there is anything these AI companies could do to make sysadmins actually trust them again anymore.
In some ways that's true.
But when it comes to git repos, an LLM agent like claude code can just clone them for local crawling which is far better than crawling remotely, and it's the "Right Way" for various reasons.
Frankly I suspect AI agents will push search in the opposite direction from your comment and move us to distributed cache workflows. These tools just hit the origin because it's the easy solution of today, not because the data needs to be up to date to the millisecond.
Imagine a system where all those Fetch(url) invocations interact with a local LRU cache. That'd be really nice, and I think that's where we'd want to go, especially once more and more origin servers try to block automated traffic.
You can also add a honeypot urls to your robots.txt to trap bots that are using it as an index
I use the same exact trick from the source the article mentions.
I call it `temu` anubis. https://github.com/rhee876527/expert-octo-robot/blob/f28e48f...
Jokes aside, the whole web seems to be trending towards some kind of wall (pay, login, app etc.) and this ultimately sucks for the open internet.
You missed the obvious portmanteau:
Temubis
Unfortunately this means, my website could only be seen if you enable javascript in your browser.
Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)
I think it would be really cool if someone built a reverse proxy just for dealing with these bad actors.
I would really like to easily serve some markov chain non-sense to Ai bots.
perhaps Iocaine [1] is what you're looking for. See the demo page [2] for what it serves to AI crawlers.
1. https://iocaine.madhouse-project.org/
2. https://poison.madhouse-project.org/
This site blocked me right away, seems quite agressive
For images you have stuff like https://nightshade.cs.uchicago.edu/whatis.html
Seems like a good way to waste tons of your bandwidth. Almost every serious data pipeline has some quality filtering in there (even open-source ones like FineWeb and EduWeb). And the stuff Iocaine generates instantly gets filtered.
Feel free to test this with any classifier or cheapo LLM.
HTTP 412 would be better I guess..
You shouldn't really serve aggressive scrapers any kind of error or otherwise unusual response, because they'll just take that as a signal to try again with a different IP address or user agent, or a residential proxy, or a headless browser, or whatever else. There's no obligation to be polite to rude guests, give them a 200 OK containing the output of a Markov chain trained on the Bee Movie script instead.
Unless your output is static, you’d then be paying the cost of running the markov generator.
> Unfortunately this means, my website could only be seen if you enable javascript in your browser. I feel this is acceptable.
I wouldn't be surprised if all this AI stuff was just a global conspiracy to get everyone to turn on JS.
tirreno (1) guy here.
Our open-source system can block IP addresses based on rules triggered by specific behavior.
Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?
1. https://github.com/tirrenotechnologies/tirreno
I believe there is a slight misunderstanding regarding the role of 'AI crawlers'.
Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.
This is approximately 30–50% of traffic on any website.
The article is about AI web crawlers. How can your tool help and how would one set it up for this specific context?
I don't see how an AI crawler is different from any others.
The simplest approach is to count the UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those are rules we already have out of the box.
It's open source, there's no pain in writing specific rules for rate limiting, thus my question.
Plus, we have developed a dashboard for manually choosing UA blocks based on name, but we're still not sure if this is something that would be really helpful for website operators.
>It's open source, there's no pain in writing specific rules for rate limiting, thus my question.
Depends on the goal.
Author wants his instance not to get killed. Request rate limiting may achieve that easily in a way transparent to normal users.
> count the UA as risky
It's trivial to spoof UAs unfortunately.
> block IP addresses based on rules triggered by specific behavior
Problem is, bots can easily can resort to resi proxies, at which point you'll end up blocking legitimate traffic.
My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.
It used to be like that, but they've changed it to a POST request a while ago.
Each access creates a new zip file on disk which is never cleaned up.
That sounds like a bug.
I think that’s been fixed in Forgejo a long time ago
> you can successfully handle many requests.
There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.
I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.
> There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant.
It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.
Sure but how much bandwidth is that actually? Of course if your normal traffic is pretty low, it's easy for bot traffic to multiply that by 5, but it doesn't mean it's actually a problem.
The market price for bandwidth in a central location (USA or Europe) is around $1-2 per TB and less if you buy in bulk. I think it's somewhat cheaper in Europe than in the USA due to vastly stronger competition. Hetzner includes 20TB outgoing with every Europe VPS plan, and 1€/TB +VAT overage. Most providers aren't quite so generous but still not that bad. How much are you actually spending?
Maybe it is fast enough but my objection is mostly due to the gross inefficiency of crawlers. Requesting downloads of whole repositories over and over, leading to storing these archives on disk wasting CPU cycles to create them and storage space to retain them, and bandwidth to sent them over the wire. Add this to the gross power consumption of AI and hogging of physical compute hardware, and it is easy to see “AI” as wasteful.
We should just have some standard for crawlable archived versions of pages with no back end or DB interaction behind them etc., for example if there's a reverse proxy, whatever it outputs is archived and it wouldn't actually pass on any call in the archive version. Same for translating the output of any dynamic JS into fully static HTML. Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.
Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.
Or if PoW was a proper web standard with no JS, then ppl who want to tell AI and other crawlers to fuck off, they could at least make it uneconomical to crawl their stuff en masse. In my view, proof of work that would work through headers in the current day world should be as ubiquitous as TLS.
never heard of forgejo, should one switch from gitea
it's a fork of gitea.
A similar approach can be done by writing a cookie by the proxy/webserver itself by visiting some path ie: example.net/sesame/open.
For a single user or a small team this could be enough.
I'm glad the author clarified he wants to prevent his instance from crashing not simply "block robots and allow humans".
I think the idea that you can block bots and allow humans is fallacious.
We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.
Also, perhaps I'm biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.
I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).
The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.
The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.
> I think this is the future way to consume a lot of the web
I think I see many prompt injections in your future. Like captchas with a special bypass solution just for AIs that leads to special content.
And people who block AI crawlers on moral grounds?
Cloudflare has a solution to protect routes from crawlers.
https://blog.cloudflare.com/introducing-pay-per-crawl/
Sure, but the whole point of self-hosting forgejo is to not use these big cloud solutions. Introducing cloudflare is a step back!
The way is same with cloudflare. Cloudflare has a documentation and manifest.
I'm so fascinated by replies like this, it's too random and nonsensical to be a language barrier issue, but it also does not pattern match into LLM generated text. Reminds me of ~2010 era wordpress comment spam.
I've written a post on why I don't think it'll succeed: https://developerwithacat.com/blog/202507/cloudflare-pay-per...
recently I just noticed github trying(but failed) to charge the self host runners, I find a afternoon to setup a mini PC to install freeBSD and gitaea on it, then setup tailscale to let it only listen on the 100.64.x.x IP address.
Since I do not make this node public accessable, so no worry for AI web crawlers:)