I have a small static site. I haven't touched it in a couple of years.
Even then, I see bot after bot, pulling down about 1/2 GB per day.
Like, I distribute Python wheels from my site, with several release versions X several Python versions.
I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:
Last-Modified: Thu, 25 May 2023 09:07:25 GMT
ETag: "8c2f67-5fc80f2f3b3e6"
Well, I know the answer to the second question, as DeVault's title highlights - it's cheaper to re-read the data and re-process the content than set up a local cache.
Externalizing their costs onto me.
I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.
Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.
This is a pet peeve of Rachel by the Bay. She sets strict limits on her RSS feed based on not properly using the provided caching headers. I wonder if anyone has made a WAF that automates this sort of thing.
"I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. ... it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all."
Given that they're actively trying to obfuscate their activity (according to Drew's description), identifying and blocking clients seems unlikely to work. I'd be tempted to de-prioritize the more expensive types of queries (like "git blame") and set per repository limits. If a particular repository gets hit too hard, further requests for it will go on the lowest-priority queue and get really slow. That would be slightly annoying for legitimate users, but still better than random outages due to system-wide overload.
BTW isn't the obfuscation of the bots' activity a tacit admission by their owners that they know they're doing something wrong and causing headaches for site admins? In the copyright world that becomes wilful infringement and carries triple damages. Maybe it should be the same for DoS perpetrators.
Just to clarify, my understanding is that she doesn't block user agent strings, she blocks based on IP and not respecting caching headers (basically, "I know you already looked at this resource and are not including the caching tags I gave to you"). It's a different problem than the original article discusses, but perhaps more similar to @dalke's issue.
I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?
If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.
I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?
The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?
> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)
They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)
Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.
I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.
They can be local LLMs doing search, some SETI@Home style distributed work, or else.
I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.
IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.
> the stats show 2K hits for some pages on some days
This has been happening since long before LLMs. Fifteen years ago, my blog would see 3k visitors a day on server logs, but only 100 on Google Analytics. Bots were always scraping everything.
Between Microsoft and Google, my existence AND presence as a community open source developer is being scraped and stolen.
I've been trying to write a body of audio code that sounds better than the stuff we got used to in the DAW era, doing things like dithering the mantissa of floating-point words, just experimental stuff ignoring the rules. Never mind if it works: I can think it does, but my objection holds whether it does or not.
Firstly, if you rip my stuff halfway it's pointless: without the coordinated intention towards specific goals not corresponding with normally practiced DSP, it's useless. LLMs are not going to 'get' the intention behind what I'm doing while also blending it with the very same code I'm a reaction against, the code that greatly outnumbers my own contributions. So even if you ask it to rip me off it tries to produce a synthesis with what I'm actively avoiding, resulting in a fantasy or parody of what I'm trying to make.
Secondly, suppose it became possible to make it hallucinate IN the relevant style, perhaps by training exclusively on my output, so it can spin off variations. That's not so far-fetched: _I_ do that. But where'd the style come from, that you'd spend effort tuning the LLM to? Does it initiate this on its own? Would you let it 'hallucinate' in that direction in the belief that maybe it was on to something? No, it doesn't look like that's a thing. When I've played with LLMs (I have a Mac Studio set up with enough RAM to do that) it's been trying to explore what the thing might do outside of expectation, and it's hard to get anything interesting that doesn't turn out to be a rip from something I didn't know about, but it was familiar with. Not great to go 'oh hey I made it innovate!' when you're mistakenly ripping off an unknown human's efforts. I've tried to explore what you might call 'native hallucination', stuff more inherent to collective humanity than to an individual, and I'm not seeing much facility with that.
Not that people are even looking for that!
And lastly, as a human trying to explore an unusual position in audio DSP code with many years of practice attempting these things and sharing them with the world around me only to have Microsoft try to reduce me to a nutrient slurry that would add a piquant flavor to 'writing code for people', I turn around and find Google, through YouTube, repeatedly offering to speak FOR me in response to my youtube commenters. I'm sure other people have seen this: probably depends on how interactive you are with your community. YouTube clearly trains a custom LLM on my comment responses to my viewers, that being text they have access to (doubtless adding my very verbose video footnotes) to the point that they're regularly offering to BE ME and save me the trouble.
Including technical explanations and helpful suggestions of how to use my stuff, that's not infrequently lies and bizarro world interpretations of what's going on, plus encouraging or self-congratulatory remarks that seem partly drawn from known best practices for being an empty hype beast competing to win the algorithm.
I'm not sure whether I prefer this, or the supposed promise of the machines.
If it can't be any better than this, I can keep working as I am, have my intentionality and a recognizable consistent sound and style, and be full of sass and contempt for the machines, and that'll remain impossible for that world to match (whether they want to is another question… but purely in marketing terms, yes they'll want to because it'll be a distinct area to conquer once the normal stuff is all a gray paste)
If it follows the path of the YouTube suggestions, there will simply be more noise out there, driven by people trying to piggyback off the mindshare of an isolated human doing a recognizable and distinct thing for most of his finite lifetime, with greater and greater volume of hollow mimicry of that person INCLUDING mimicry of his communications and interpersonal tone, the better to shunt attention and literal money to, not the LLMs doing the mimicking, but a third party working essentially in marketing, trying to split off a market segment they've identified as not only relevant, but ripe for plucking because the audience self-identifies as eager to consume the output of something that's not usual and normal.
(I should learn French: that rant is structurally identical to an endlessly digressive French expostulation)
Today I'm doing a livestream, coding with a small audience as I try for the fourth straight day to do a particular sort of DSP (decrackling) that's previously best served by some very expensive proprietary software costing over two thousand dollars for a license. Ideally I can get some of the results while also being able to honor my intentions for preserving the aspects of the audio I value (which I think can be compromised by such invasive DSP). That's because my intention will include this preservation, these iconoclastic details I think important, the trade-offs I think are right.
Meanwhile crap is trained on my work so that a guy who wants money can harness rainforests worth of wasted electrical energy to make programs that don't even work, and a pretend scientist guru persona who can't talk coherently but can and will tell you that he is "a real audio hero who's worked for many years to give you amazing free plugins that really defy all the horrible rules that are ruining music"!
Because this stuff can't pay attention, but it can draw all the wrong conclusions from your tone.
And if you question your own work and learn and grow from your missteps to have greater confidence in your learned synthesis of knowledge, it can't do that either but it can simultaneously bluster with your confidence and also add 'but who knows maybe I'm totally wrong lol!'
And both are forms of lies, as it has neither confidence nor self-doubt.
I'm going on for longer than the original article. Sorry.
> The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not?
I think you cannot distinguish. But the issue is so large that Google now serves captchas on legitimate traffic, sometimes after the first search term if you narrow down the time window (less than 24 hours).
I wonder when real Internet companies will feel the hurt, simply because consumers will stop using the ruined Internet.
I run a small browser game—roughly 150 unique weekly active users.
Our Wiki periodically gets absolutely hammered by LLM scraper bots, rotating IP addresses like mad to avoid mitigations like fail2ban (which I do have in place). And even when they're not hitting it hard enough to crash the game (through the external data feeds many of the wiki pages rely on), they're still scraping pretty steadily.
There is no earthly way that my actual users are able to sustain ~400kbps outbound traffic round the clock.
> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.
I think there is a good chance of the VC/Startup side of the bubble bursting.
However, I think we will never go back to a time without using LLMs, given that you can run useful open-weights models on a local laptop.
share your settings and system specs please, I haven't seen anything come out of a local LLM that was useful.
if you don't, since you're using a throwaway handle, I'll just assume you're paid to post. it is a little odd that you'd use a throwaway just to post LLM hype.
Happy to post mine (which is also not behind throwaway handle).
Machine: 2021 Macbook Pro with M1 Max 32GB
LLMs I usually use: Qwen 2.5 Coder 7B for coding and the latest Mistral or Gemma in the 4B-7B range for most other stuff
For interactive work I still use mostly Cursor with Claude, but for scripted workflows with higher privacy requirements (and where I don't want to be hit with a huge bill due to a rogue script), I also regularly use those models.
If you are interested in running stuff locally take a look at /r/LocalLLaMA [0] which usually gives good insights into what's currently viable for what use cases for running locally. A good chunk of the power-users there are using dedicated machines for it, but a good portion are in the same boat as me and trying to run whatever can be fit on their existing local machine, where I would estimate the coding capbilities to lag ~6-9 months in comparison to the SOTA big models (which is still pretty great).
Not sam. I am running it with ollama on a server on my lan with two 7900XT. I get about 50-60 tokens per second on phi4-mini with full precision, it only loads on a single card.
The few requests I tried were correct, I think that phi4 the 14 b parameters model produced better code though. I don't recall what it was, it was rather simple stuff though.
QwQ seems to produce okay code as well, but with only 40GB of vram I can only use about 8k context with 8bit quantization.
> I haven't seen anything come out of a local LLM that was useful.
By far the most useful use case for me, is when I want to do something in a repl or the shell, I only vaguely remember how the library or command I am about to use works and just ask it to write the command for me instead of reading through the manual or docs.
That’s funny because after using Cursor with Claude for a month at work at the request of the CTO, I have found myself reverting to neovim and am more productive. I see the sliver of value but not for complex coding requirements.
It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?
Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.
(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)
> Either that or make following robots.txt a legal requirement [...]
A legal requirement in what jurisdiction, and to be enforced how and by whom?
I guess the only feasible legislation here is something where the victim pursues a case with a regulating agency or just through the courts directly. But how does the victim even find the culprit when the origin of the crawling is being deliberately obscured, with traffic coming from a botnet running on exploited consumer devices?
It wouldn't have to go that deep. If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions, then there would presumably have to be an entity in those jurisdictions, such as a VPN provider, an illegal botnet, or a legal botnet, and you pursue legal action with those.
The VPNs and legal botnets would be heavily incentivized to not allow this to happen (and presumably already are doing traffic analysis), and illegal botnets should be shutdown anyway (some grace in the law about being unaware of it happening should of course be afforded, but once you are aware it is your responsibility to prevent your machine from committing crimes).
Illegal botnets aren't new. Are they currently shutdown regularly? (I'm actually asking, I don't know)
> If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions
That sounds kinda like the balkanization of the internet. It's not without some cost. I don't mean financially, but in terms of eroding the connectedness that is supposed to be one of the internet's great benefits.
Maybe people need to add deliberate traps on their websites. You could imagine a provider like Cloudflare injecting a randomly generated code phrase into thousands of sites and making sure to attribute it under a strict license, that is invisible so that no human sees it, and changes every few days. Presumably LLMs would learn this phrase and later be able to repeat it - getting a sufficiently high hit rate should be proof that they used illegitimately obtained data. Kinda like back in the old days when map makers included fake towns, rivers and so on in their maps so that if others copied it they could tell
We should start generating non sense information to feed up these hungry crawlers.
The problem is that they will keep working to find a way to mimic human behaviour to get the real data.
I feel like I recall someone recently built a simple proof of work CAPTCHA for their personal git server. Would something like that help here?
Alternatively, a technique like Privacy Pass might be useful here. Basically, give any user a handful of tokens based on reputation / proof of work, and then make each endpoint require a token to access. This gives you a single endpoint to rate limit, and doesn’t require user accounts (although you could allow known-polite user accounts a higher rate limit on minting tokens).
Here we get to the original sin of packet networking.
The ARPANET was never meant to be commercial or private. All the protocols developed for it were meant to be subsidized by universities, the government or the military, with the names, phone numbers, and addresses of anyone sending a packet being public knowledge to anyone else in the network.
This made sense for the time since the IMPs used to send packets had less computing power than an addressable LED today.
Today the average 10 year old router has more computing power than was available in the whole world in 1970, but we've not made any push to move to protocols that incorporate price as a fundamental part of their design.
Worse is that I don't see anyway that we can implement this. So we're left with screeds by people who want information to be free, but get upset when they find out that someone has to pay for information.
Would you be rate-limiting by IP? Because the attacker is using (nearly) unique IPs for each request, so I don't see how that would help.
> someone recently built a simple proof of work CAPTCHA for their personal git server
As much as everyone hates CAPTCHAs nowadays, I think this could be helpful if it was IP-oblivious, random and the frequency was very low. E.g., once per 100 requests would be enough to hurt abusers making 10000 requests, but would cause minimal inconvenience to regular users.
I didn’t mention IP at all in my post. You can globally rate limit anonymous requests for everything (except maybe your login endpoint), if that’s the thing that makes sense for you.
The nice thing about the proof-of-work approach is that it can be backgrounded for users with normal browsers, just like link prefetching.
> Obviously you can do this, but that will start blocking everyone. How is that a solution?
Two corrections: It will start rate-limiting (not blocking) anonymous users (not everyone). It's a mitigation. If specific logged-in users are causing problems, you can address them directly (rate limit them, ban them, charge them money, etc). If nonspecific anonymous users are causing problems, you can limit them as a group, and provide an escape hatch (logging in). If your goal is to "free access to everyone except people I don't like but I can't ask people if they are people who I don't like", well, I suppose it isn't a good mitigation for you, sorry.
> Also what prevents attacker-controlled browsers from backgrounding the PoW too?
The cost. A single user can complete the proof of work in a short period of time, costing them a few seconds of CPU cycles. Scaling this up to an industrial-scale scraping operation means that the "externalized costs" that the OP was talking about are duplicated as internalized costs in the form of useless work that must be done for you to accept their requests.
> If your goal is to "free access to everyone except people I don't like but I can't ask people if they are people who I don't like", well, I suppose it isn't a good mitigation for you, sorry.
Ah. Yes, the part in quotes here is what I think would count as a solution -- I've been assuming that simply steering anonymous users towards logging in would be the obvious thing to do otherwise, and that doing this is unacceptable for some reason. I was hoping that, despite attackers dispersing themselves across IP addresses, there would either be (a) some signal that nevertheless identifies them with reasonable probability (perhaps Referer headers, or their absence in deeply nested URL requests), or (b) some blanket policy that can be enforced which will hurt everyone a little but hurt attackers more (think chemotherapy).
> It will start rate-limiting (not blocking) anonymous users (not everyone).
If some entities (attackers) are making requests at 1000x the rate that others (legitimate users) are, the effect in practice of rate-limiting will be to block the latter nearly all the time.
> Scaling this up to an industrial-scale scraping operation
My understanding was that the PoW would be done in-browser, in which case this doesn't hold -- the attackers would simply use the multitudes of residential browsers they already control to do the PoW prior to making the requests, thus perfectly distributing that workload to other people's computers. What kind of PoW cannot be done in this way?
> My understanding was that the PoW would be done in-browser, in which case this doesn't hold -- the attackers would simply use the multitudes of residential browsers they already control to do the PoW prior to making the requests, thus perfectly distributing that workload to other people's computers. What kind of PoW cannot be done in this way?
I could be mistaken, but I don't think these residential VPN services are actual botnets. You can use the connection, but not the browser. In any case, you can scale the work factor as you want, making "unlikely" endpoints harder to access (e.g. git blame for an old commit might be 100x harder to prove than the main page of a repository). This doesn't make it impossible to scrape your website, it makes it more expensive to do so, which is what the OP was complaining about ("externalizing costs onto me").
All in all, it feels like there's something here to leverage proof of work as a way to maintain anonymous access while still limiting your exposure to excessive scrapers. It probably isn't a one-size-fits-all solution, but with some domain-specific knowledge it feels like it could be a useful tool to have in the new internet landscape.
> You can use the connection, but not the browser.
Fair enough, that would likely be the case if they're using "legitimate" residential IP providers, and in that case they would indeed need to pay for the PoW themselves somehow.
Is there currently any way to identify ethical (or I guess, semi-ethical or even not actively unethical..) AI companies?
Would be nice to see some effort on this front, a la "we don't scrape", or "we use this reputable dataset which scrapes while respecting robots.txt and reasonable caching", or heaven forbid "we train only on a dataset based on public domain content"
Even if it was an empty promise (or goodwill ultimately broken) it'd be _something_. If it exists, I'd certainly prioritise any products which proclaimed it
(Posting as separate comment because wall of text)
Also - Honestly don't even understand why so many people would need to scrape for training data.
Is it naïveté (thinking it necessary), arrogance (thinking it better than others, and thus justified)?
Aren't most advances now primarily focused on either higher level (agents, layered retrieval) or lower level (eg. alternatives to transformers, etc.. which would be easier to prove useful on existing datasets)?
Genuine questions, all of these - if I'm off the mark I'm keen to learn!
If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?
Have there been efforts to set up something like the SMTP block lists for web scraping bots? Seems like this is something that needs to be crowd sourced. It'll be much easier to find patterns in larger piles of data. A single real user is unlikely to do as much as quickly as a bot.
Those are largely ineffective in the modern world.
Most bot nets, like he mentions in the blag post, come from 1 IP address only 1 time. With thousands to tens of thousands of IP addresses in the bot net, there is just no way to block by IP address anymore.
just like with crypto, AI culture is fundamentally tainted (see this very thread where people are defending this shit). Legislation hasn’t caught up and doing what’s essentially a crime (functionally this is just a DDoS) is being sold as a cool thing to do by the people who don’t foot the bill.
If you ever ask yourself why everybody else hates you and your ilk, this attitude is why.
The thing I don’t understand with AI though is that without the host, your parasite is nothing. Shouldn’t it be in your best interest to not kill it?
I hear you, and there are even greater concerns than the “you kids get off of my lawn” vibe, e.g. the massive amounts of water required for cooling data centers.
But the “bubble” threat the post mentions is just emotional; things are accelerating so quickly that what is hype one day isn’t hype in a few months. LLM usage is just going to get heavier.
What should get better are filters that prevent bots from easily scraping content. Required auth could work. Yes, it breaks things. Yes, it may kill your business. If you can’t deal with that, find another solution, which might be crowdfunding, subscription, or might be giving up and moving on.
Working towards a solution makes more sense than getting angry and ranting.
The playing fields is very asymmetric. Can you negotiate with these guys? No. They're faceless when it comes to their data operations. Secretive, covert even.
Creating a solution requires cooperation. Making these models fit on smaller systems, optimizing them to work with less resources need R&D engineering, which no company wants to spend on right now. Because hype trains need to be built, monies need to be made, moats need to be dug (which is almost impossible, BTW).
My 10 year old phone can track 8 people in camera app in real time, with no lag. My A7-III can remember faces, prioritize them, focus on mammals' and humans' eyes, track them and keep them in focus at 30FPS with a small DSP.
Building these things is possible, but nobody cares about it right now, because AI Datacenters live in a similar state of ZIRP economy. They're almost free for what they do, even though they harm the environment in irreparable ways just to give you something which is not right.
People’s basic needs are food and water, followed by safety, belonging, esteem, and finally self-actualization. In addition, convenience over non-convenience.
With LLMs come a promise of safety in that they will solve our problems, potentially curing disease and solving world hunger. They can sound like people that care and make us feel respected and useful, making us feel like we belong and boosting our esteem. They do things we can’t do and help us actualize our dreams, in written code and other things. We don’t have to work hard to use them anymore, and they’re readily available.
We have been using models, and accurate ones at that, for drug discovery, nature simulation, weather forecasting, ecosystem monitoring, etc. well before LLMs with their nondescript chat boxes have arrived. AI was there, the hardware was not. Now we have the hardware, AI world is much richer than generative image models and stochastic parrots, but we live in a world where we assume that the most noisy is the best, and worthy of our attention. It's not.
The only thing is, LLMs look like they can converse while giving back worse results than these models we already have, or develop without conversation capabilities.
LLMs are just shiny parrots which can taunt us from a distance, and look charitable while doing that. What they provide is not correct information, but biased Markov chains.
It only proves that humans are as gullible as other animals. We are drawn to shiny things, even if they harm us.
I would like to add that european llm projects made from openly available sources do exist and were previously discussed here on hn. They're much smaller model than the top llm other countries have made, but iirc they're made ethically (and are open source).
The culture of "fuck you got mine" is something i despise on a very deep level, not just including ai but most big tech shitfuckery that has been going on since forever
please read my post again. crypto culture is tainted. cryptocurrency is just tech, just as ML is just tech. crypto culture however is rug pulling, rampant scams etc.
Email, again, is just tech and there is no real email culture around (mailing lists?)
I see a future where there are many more digital gated communities. I don’t mean walled gardens in the old sense like AOL or Facebook. I mean people will create private networks with wireguard or something similar. All the services are on the inside. Discoverability is sacrificed to achieve security and privacy.
I don't think that any CDN is offering adequate protection from these attacks. I wish they did. The only solution I see remains the same -- remove all your stuff from the open web. That's what I've done.
I'm constantly looking for some way that I can put everything back up, but nothing has presented itself yet. Plus, if OpenAI's attempts at legalizing what they do are successful, it seems unlikely there will ever be an adequate solution.
It doesn't matter if it's fraud, "AI" is now considered an arms race, if we require them to fairly acquire their content we'll fall behind China or another country and then America might lose the WWIII it is constantly preparing for.
You might see a couple of small players or especially egregious executives get a slap on the wrist for bad behavior but in this political climate there's no chance that Republicans or Democrats will put a stop to it.
What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to such aggressive scanning to extract a profit. Even faking the user agent and using residential IPs is really scummy.
They're doing something awesome, why wouldn't they (in their own words), I ask. What they do boils my blood, honestly.
Assuming that established companies are automatically ethical and just is not correct. Meta used a "laptop and a client tuned to seed as minimally as possible to torrent 81TBs of copyrighted material" to train their models.
For every picture perfect and shiny facade, there's an engineer sitting on the floor in corner and hacking things to prop up this facade.
> What problem would you have with me doing that with your comment?
Nothing.
Addenda:
However, while I'll be putting that comment to my favorite screenshots folder, and letting it adorn my cubicle for some time, there's also an implicit meaning. With that wording, the subscript is "That's brave". [0]
It's not meant to be an insult, but to signify a greatly different point of view compared to OP (me, in this case).
While the tone may came across as angry, slightly agitated is a more correct measurement of my feelings.
In either way, it's fair game and no hard feelings. Also, the answer from the original commenter is good sport. Thanks mate!
I see. I didn’t take it quite so literally on first read. It initially read (and still does if I want it to, I guess) like people being upset about reddit using their posts for AI training, which I’ve never understood. The second sentence in particular.
> ...people being upset about reddit using their posts for AI training,...
I'm one of these people, and I'm not visiting or contributing to Reddit anymore. My logic is simple: "Ask me first". I may say yes, I may say no, but ask me first.
I'm bullish about ethics and transparency. I work on FAIR Data at work, as a side quest. Transparency, consent and licenses are a big part of it. Honoring it is important at every level. Both ethically, and both for integrity reasons.
As a result, when someone does something behind me, and if they didn't ask me first, I leave when I find out. I was this close to abandoning Go when they first declared that the telemetry will be opt-out. Now it's opt-in, and so I still use Go.
It's not that I don't send usage data, either. Some applications and parties I trust get telemetry from me, but all of them were opt-in and asked me first.
In short, a bit of decency and agency goes a long way in my eyes.
I can only rejoice a bit thanking the fact their models can and will be distilled and perhaps copied in a few months in other less expensive and maybe open source models
> All of my sysadmin friends are dealing with the same problems.
I'm seeing this as well. Some of the websites my company operates suddenly go through some 10-25x in requests per minute at night. Most often the AS where the ip comes from is from some Microsoft datacenter (meaning, it's most likely OpenAI).
At this point i'm starting to consider blocklisting Microsoft AS numbers and/or requiring login (or some kind of api key) when coming from a datacenter network. Reddit does this as well, already, for example.
> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
The solution here is simple: require user logins for data retrieval. Each user gets a quota as per a standard bucket algorithm. Don't overcomplicate a non-issue.
requiring logins for /everything/ would close up the web. Now... requiring a login for viewing git blame (which is resource intensive) might be a good idea...
All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.
Once the allmende, the grass runs out, things get interesting. We shall see Cyperpunk like computational parasitism plaguing companies and attempts to filter these work out. I guess that is the only way, to really prevent that sort of bot. You take arbitrary pieces of the UOW they want done- and reject them on principal. And depending on the cleverness of the batch algorithm, they will come back with that again and again, identifying themselves via repetition.
You're assuming it would make a difference: they're already just hoovering up 'usual stuff' and splicing it all sorts of ways. For these purposes there's no difference between 'deliberately incorrect' and 'just not very good at its job'.
No point in taking effort to recreate what's already out there as authentic not-goodery :)
I work for a hosting and I know what this is like. And, while I completely respect that you don’t want to give out your resources for free, a properly programmed and/or cached website wouldn’t be brought down by crawlers, no matter how aggressive. Crawlers are hitting our clients sites all the same but you only hear problems from those who have piece of shit websites that take seconds to generate.
I guess for computationally expensive things the only real option is to put it behind a login. I’m sure this is something SourceHut doesn’t want to do but maybe it’s their only decent option.
On SourceHut, git blame is available while logged out, but the link to blame pages is only showed to logged-in users. That may be a recent change to fight scrapers.
Precomputing git blame should take the same order of magnitude of storage as the repository itself. Same number of lines for each revision, same number of lines changed in every commit.
>Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.
Anyone with any inkling of the current industry understands this is /Old-Man-Yells-At-Cloud. Why offer this as a solution even, I'm unsure.
Here's potential solutions:
1) Cloudflare.
2) Own DDoS protecting reverse proxy using captcha etc.
3) Poisoning scraping results (with syntax errors in code lol), so they ditch you as a good source of data.
to stay with the analogy, people won’t wear the stab-proof vest or at least not stop there. People will stab back.
The magnitude of the attack by bad actors in the AI space is potentially life ruining and they shouldn’t expect not to have their lives ruined in retaliation if this keeps up.
I disagree with the premise of this article, which is: the internet used to be so nice when it wasn't the center of humanity's activities but now everything sucks.
I read this: "I am sick and tired of having all of these costs externalized directly into my fucking face", as: "I am sick and tired of doing something for the world and the world not appreciating it because they want to do their own things".
Lack of appreciation is underrated as a problem because it's abstracted away due to the mechanics of the free market ("if you don't want it, just don't buy it"). Yet, markets are quite foundational in humanity's activities and a market is place where everybody can put a stall. Now, Drew can't have his stall and he's angry. But anger is rarely part of the solution.
His path is that of anger, unfortunately, with some hate added in too:
> "If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts."
I have a small static site. I haven't touched it in a couple of years.
Even then, I see bot after bot, pulling down about 1/2 GB per day.
Like, I distribute Python wheels from my site, with several release versions X several Python versions.
I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:
Well, I know the answer to the second question, as DeVault's title highlights - it's cheaper to re-read the data and re-process the content than set up a local cache.Externalizing their costs onto me.
I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.
Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.
This is a pet peeve of Rachel by the Bay. She sets strict limits on her RSS feed based on not properly using the provided caching headers. I wonder if anyone has made a WAF that automates this sort of thing.
I'm pretty sure:
"I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. ... it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all."
means that no one has managed it.
Given that they're actively trying to obfuscate their activity (according to Drew's description), identifying and blocking clients seems unlikely to work. I'd be tempted to de-prioritize the more expensive types of queries (like "git blame") and set per repository limits. If a particular repository gets hit too hard, further requests for it will go on the lowest-priority queue and get really slow. That would be slightly annoying for legitimate users, but still better than random outages due to system-wide overload.
BTW isn't the obfuscation of the bots' activity a tacit admission by their owners that they know they're doing something wrong and causing headaches for site admins? In the copyright world that becomes wilful infringement and carries triple damages. Maybe it should be the same for DoS perpetrators.
Just to clarify, my understanding is that she doesn't block user agent strings, she blocks based on IP and not respecting caching headers (basically, "I know you already looked at this resource and are not including the caching tags I gave to you"). It's a different problem than the original article discusses, but perhaps more similar to @dalke's issue.
I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?
If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.
I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?
Good rant!
The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?
> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)
They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)
Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.
I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.
They can be local LLMs doing search, some SETI@Home style distributed work, or else.
I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.
IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.
> the stats show 2K hits for some pages on some days
This has been happening since long before LLMs. Fifteen years ago, my blog would see 3k visitors a day on server logs, but only 100 on Google Analytics. Bots were always scraping everything.
> I never asked/consented for this.
You put it on the information super highway baby
> You put it on the information super highway baby
...with proper licenses.
Here, FTFY, baby.
Addenda: Just because you don't feel like obeying/honoring them doesn't make the said licenses moot and toothless. I mean, if you know, you know.
Between Microsoft and Google, my existence AND presence as a community open source developer is being scraped and stolen.
I've been trying to write a body of audio code that sounds better than the stuff we got used to in the DAW era, doing things like dithering the mantissa of floating-point words, just experimental stuff ignoring the rules. Never mind if it works: I can think it does, but my objection holds whether it does or not.
Firstly, if you rip my stuff halfway it's pointless: without the coordinated intention towards specific goals not corresponding with normally practiced DSP, it's useless. LLMs are not going to 'get' the intention behind what I'm doing while also blending it with the very same code I'm a reaction against, the code that greatly outnumbers my own contributions. So even if you ask it to rip me off it tries to produce a synthesis with what I'm actively avoiding, resulting in a fantasy or parody of what I'm trying to make.
Secondly, suppose it became possible to make it hallucinate IN the relevant style, perhaps by training exclusively on my output, so it can spin off variations. That's not so far-fetched: _I_ do that. But where'd the style come from, that you'd spend effort tuning the LLM to? Does it initiate this on its own? Would you let it 'hallucinate' in that direction in the belief that maybe it was on to something? No, it doesn't look like that's a thing. When I've played with LLMs (I have a Mac Studio set up with enough RAM to do that) it's been trying to explore what the thing might do outside of expectation, and it's hard to get anything interesting that doesn't turn out to be a rip from something I didn't know about, but it was familiar with. Not great to go 'oh hey I made it innovate!' when you're mistakenly ripping off an unknown human's efforts. I've tried to explore what you might call 'native hallucination', stuff more inherent to collective humanity than to an individual, and I'm not seeing much facility with that.
Not that people are even looking for that!
And lastly, as a human trying to explore an unusual position in audio DSP code with many years of practice attempting these things and sharing them with the world around me only to have Microsoft try to reduce me to a nutrient slurry that would add a piquant flavor to 'writing code for people', I turn around and find Google, through YouTube, repeatedly offering to speak FOR me in response to my youtube commenters. I'm sure other people have seen this: probably depends on how interactive you are with your community. YouTube clearly trains a custom LLM on my comment responses to my viewers, that being text they have access to (doubtless adding my very verbose video footnotes) to the point that they're regularly offering to BE ME and save me the trouble.
Including technical explanations and helpful suggestions of how to use my stuff, that's not infrequently lies and bizarro world interpretations of what's going on, plus encouraging or self-congratulatory remarks that seem partly drawn from known best practices for being an empty hype beast competing to win the algorithm.
I'm not sure whether I prefer this, or the supposed promise of the machines.
If it can't be any better than this, I can keep working as I am, have my intentionality and a recognizable consistent sound and style, and be full of sass and contempt for the machines, and that'll remain impossible for that world to match (whether they want to is another question… but purely in marketing terms, yes they'll want to because it'll be a distinct area to conquer once the normal stuff is all a gray paste)
If it follows the path of the YouTube suggestions, there will simply be more noise out there, driven by people trying to piggyback off the mindshare of an isolated human doing a recognizable and distinct thing for most of his finite lifetime, with greater and greater volume of hollow mimicry of that person INCLUDING mimicry of his communications and interpersonal tone, the better to shunt attention and literal money to, not the LLMs doing the mimicking, but a third party working essentially in marketing, trying to split off a market segment they've identified as not only relevant, but ripe for plucking because the audience self-identifies as eager to consume the output of something that's not usual and normal.
(I should learn French: that rant is structurally identical to an endlessly digressive French expostulation)
Today I'm doing a livestream, coding with a small audience as I try for the fourth straight day to do a particular sort of DSP (decrackling) that's previously best served by some very expensive proprietary software costing over two thousand dollars for a license. Ideally I can get some of the results while also being able to honor my intentions for preserving the aspects of the audio I value (which I think can be compromised by such invasive DSP). That's because my intention will include this preservation, these iconoclastic details I think important, the trade-offs I think are right.
Meanwhile crap is trained on my work so that a guy who wants money can harness rainforests worth of wasted electrical energy to make programs that don't even work, and a pretend scientist guru persona who can't talk coherently but can and will tell you that he is "a real audio hero who's worked for many years to give you amazing free plugins that really defy all the horrible rules that are ruining music"!
Because this stuff can't pay attention, but it can draw all the wrong conclusions from your tone.
And if you question your own work and learn and grow from your missteps to have greater confidence in your learned synthesis of knowledge, it can't do that either but it can simultaneously bluster with your confidence and also add 'but who knows maybe I'm totally wrong lol!'
And both are forms of lies, as it has neither confidence nor self-doubt.
I'm going on for longer than the original article. Sorry.
This sounds pretty interesting. Can you share a link to your work or livestream?
> The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not?
I think you cannot distinguish. But the issue is so large that Google now serves captchas on legitimate traffic, sometimes after the first search term if you narrow down the time window (less than 24 hours).
I wonder when real Internet companies will feel the hurt, simply because consumers will stop using the ruined Internet.
I run a small browser game—roughly 150 unique weekly active users.
Our Wiki periodically gets absolutely hammered by LLM scraper bots, rotating IP addresses like mad to avoid mitigations like fail2ban (which I do have in place). And even when they're not hitting it hard enough to crash the game (through the external data feeds many of the wiki pages rely on), they're still scraping pretty steadily.
There is no earthly way that my actual users are able to sustain ~400kbps outbound traffic round the clock.
> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.
I think there is a good chance of the VC/Startup side of the bubble bursting. However, I think we will never go back to a time without using LLMs, given that you can run useful open-weights models on a local laptop.
Yeah I totally agree, I also run phi4 mini locally and was thoroughly impressed. The genie is out of the bottle.
share your settings and system specs please, I haven't seen anything come out of a local LLM that was useful.
if you don't, since you're using a throwaway handle, I'll just assume you're paid to post. it is a little odd that you'd use a throwaway just to post LLM hype.
is that you, Sam?
Happy to post mine (which is also not behind throwaway handle).
Machine: 2021 Macbook Pro with M1 Max 32GB
LLMs I usually use: Qwen 2.5 Coder 7B for coding and the latest Mistral or Gemma in the 4B-7B range for most other stuff
For interactive work I still use mostly Cursor with Claude, but for scripted workflows with higher privacy requirements (and where I don't want to be hit with a huge bill due to a rogue script), I also regularly use those models.
If you are interested in running stuff locally take a look at /r/LocalLLaMA [0] which usually gives good insights into what's currently viable for what use cases for running locally. A good chunk of the power-users there are using dedicated machines for it, but a good portion are in the same boat as me and trying to run whatever can be fit on their existing local machine, where I would estimate the coding capbilities to lag ~6-9 months in comparison to the SOTA big models (which is still pretty great).
[0]: https://www.reddit.com/r/LocalLLaMA
Not sam. I am running it with ollama on a server on my lan with two 7900XT. I get about 50-60 tokens per second on phi4-mini with full precision, it only loads on a single card.
The few requests I tried were correct, I think that phi4 the 14 b parameters model produced better code though. I don't recall what it was, it was rather simple stuff though.
QwQ seems to produce okay code as well, but with only 40GB of vram I can only use about 8k context with 8bit quantization.
It is a 6 year old throwaway.
In practice it only means it is meant to be anonymous and possible to throwaway.
Or it was meant for throwaway but they kept using it.
It was meant to be a throwaway but I kept it.
> I haven't seen anything come out of a local LLM that was useful.
By far the most useful use case for me, is when I want to do something in a repl or the shell, I only vaguely remember how the library or command I am about to use works and just ask it to write the command for me instead of reading through the manual or docs.
That’s funny because after using Cursor with Claude for a month at work at the request of the CTO, I have found myself reverting to neovim and am more productive. I see the sliver of value but not for complex coding requirements.
What did you find it useful for in this afternoon?
It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?
Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.
(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)
> Either that or make following robots.txt a legal requirement [...]
A legal requirement in what jurisdiction, and to be enforced how and by whom?
I guess the only feasible legislation here is something where the victim pursues a case with a regulating agency or just through the courts directly. But how does the victim even find the culprit when the origin of the crawling is being deliberately obscured, with traffic coming from a botnet running on exploited consumer devices?
It wouldn't have to go that deep. If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions, then there would presumably have to be an entity in those jurisdictions, such as a VPN provider, an illegal botnet, or a legal botnet, and you pursue legal action with those.
The VPNs and legal botnets would be heavily incentivized to not allow this to happen (and presumably already are doing traffic analysis), and illegal botnets should be shutdown anyway (some grace in the law about being unaware of it happening should of course be afforded, but once you are aware it is your responsibility to prevent your machine from committing crimes).
> illegal botnets should be shutdown anyway
Illegal botnets aren't new. Are they currently shutdown regularly? (I'm actually asking, I don't know)
> If we made not following robots.txt illegal in certain jurisdictions, and blocked all IP addresses not from those jurisdictions
That sounds kinda like the balkanization of the internet. It's not without some cost. I don't mean financially, but in terms of eroding the connectedness that is supposed to be one of the internet's great benefits.
Maybe people need to add deliberate traps on their websites. You could imagine a provider like Cloudflare injecting a randomly generated code phrase into thousands of sites and making sure to attribute it under a strict license, that is invisible so that no human sees it, and changes every few days. Presumably LLMs would learn this phrase and later be able to repeat it - getting a sufficiently high hit rate should be proof that they used illegitimately obtained data. Kinda like back in the old days when map makers included fake towns, rivers and so on in their maps so that if others copied it they could tell
We should start generating non sense information to feed up these hungry crawlers. The problem is that they will keep working to find a way to mimic human behaviour to get the real data.
I feel like I recall someone recently built a simple proof of work CAPTCHA for their personal git server. Would something like that help here?
Alternatively, a technique like Privacy Pass might be useful here. Basically, give any user a handful of tokens based on reputation / proof of work, and then make each endpoint require a token to access. This gives you a single endpoint to rate limit, and doesn’t require user accounts (although you could allow known-polite user accounts a higher rate limit on minting tokens).
Here we get to the original sin of packet networking.
The ARPANET was never meant to be commercial or private. All the protocols developed for it were meant to be subsidized by universities, the government or the military, with the names, phone numbers, and addresses of anyone sending a packet being public knowledge to anyone else in the network.
This made sense for the time since the IMPs used to send packets had less computing power than an addressable LED today.
Today the average 10 year old router has more computing power than was available in the whole world in 1970, but we've not made any push to move to protocols that incorporate price as a fundamental part of their design.
Worse is that I don't see anyway that we can implement this. So we're left with screeds by people who want information to be free, but get upset when they find out that someone has to pay for information.
You're likely thinking of Anubis from Techaro: https://github.com/TecharoHQ/anubis
Yes, thank you.
> This gives you a single endpoint to rate limit
Would you be rate-limiting by IP? Because the attacker is using (nearly) unique IPs for each request, so I don't see how that would help.
> someone recently built a simple proof of work CAPTCHA for their personal git server
As much as everyone hates CAPTCHAs nowadays, I think this could be helpful if it was IP-oblivious, random and the frequency was very low. E.g., once per 100 requests would be enough to hurt abusers making 10000 requests, but would cause minimal inconvenience to regular users.
I didn’t mention IP at all in my post. You can globally rate limit anonymous requests for everything (except maybe your login endpoint), if that’s the thing that makes sense for you.
The nice thing about the proof-of-work approach is that it can be backgrounded for users with normal browsers, just like link prefetching.
> You can globally rate limit anonymous requests for everything
Obviously you can do this, but that will start blocking everyone. How is that a solution?
Also what prevents attacker-controlled browsers from backgrounding the PoW too?
I took it as read that for something to qualify as a solution to this problem, it needs to affect regular users less badly than attackers.
> Obviously you can do this, but that will start blocking everyone. How is that a solution?
Two corrections: It will start rate-limiting (not blocking) anonymous users (not everyone). It's a mitigation. If specific logged-in users are causing problems, you can address them directly (rate limit them, ban them, charge them money, etc). If nonspecific anonymous users are causing problems, you can limit them as a group, and provide an escape hatch (logging in). If your goal is to "free access to everyone except people I don't like but I can't ask people if they are people who I don't like", well, I suppose it isn't a good mitigation for you, sorry.
> Also what prevents attacker-controlled browsers from backgrounding the PoW too?
The cost. A single user can complete the proof of work in a short period of time, costing them a few seconds of CPU cycles. Scaling this up to an industrial-scale scraping operation means that the "externalized costs" that the OP was talking about are duplicated as internalized costs in the form of useless work that must be done for you to accept their requests.
> If your goal is to "free access to everyone except people I don't like but I can't ask people if they are people who I don't like", well, I suppose it isn't a good mitigation for you, sorry.
Ah. Yes, the part in quotes here is what I think would count as a solution -- I've been assuming that simply steering anonymous users towards logging in would be the obvious thing to do otherwise, and that doing this is unacceptable for some reason. I was hoping that, despite attackers dispersing themselves across IP addresses, there would either be (a) some signal that nevertheless identifies them with reasonable probability (perhaps Referer headers, or their absence in deeply nested URL requests), or (b) some blanket policy that can be enforced which will hurt everyone a little but hurt attackers more (think chemotherapy).
> It will start rate-limiting (not blocking) anonymous users (not everyone).
If some entities (attackers) are making requests at 1000x the rate that others (legitimate users) are, the effect in practice of rate-limiting will be to block the latter nearly all the time.
> Scaling this up to an industrial-scale scraping operation
My understanding was that the PoW would be done in-browser, in which case this doesn't hold -- the attackers would simply use the multitudes of residential browsers they already control to do the PoW prior to making the requests, thus perfectly distributing that workload to other people's computers. What kind of PoW cannot be done in this way?
> My understanding was that the PoW would be done in-browser, in which case this doesn't hold -- the attackers would simply use the multitudes of residential browsers they already control to do the PoW prior to making the requests, thus perfectly distributing that workload to other people's computers. What kind of PoW cannot be done in this way?
I could be mistaken, but I don't think these residential VPN services are actual botnets. You can use the connection, but not the browser. In any case, you can scale the work factor as you want, making "unlikely" endpoints harder to access (e.g. git blame for an old commit might be 100x harder to prove than the main page of a repository). This doesn't make it impossible to scrape your website, it makes it more expensive to do so, which is what the OP was complaining about ("externalizing costs onto me").
All in all, it feels like there's something here to leverage proof of work as a way to maintain anonymous access while still limiting your exposure to excessive scrapers. It probably isn't a one-size-fits-all solution, but with some domain-specific knowledge it feels like it could be a useful tool to have in the new internet landscape.
> You can use the connection, but not the browser.
Fair enough, that would likely be the case if they're using "legitimate" residential IP providers, and in that case they would indeed need to pay for the PoW themselves somehow.
Is there currently any way to identify ethical (or I guess, semi-ethical or even not actively unethical..) AI companies?
Would be nice to see some effort on this front, a la "we don't scrape", or "we use this reputable dataset which scrapes while respecting robots.txt and reasonable caching", or heaven forbid "we train only on a dataset based on public domain content"
Even if it was an empty promise (or goodwill ultimately broken) it'd be _something_. If it exists, I'd certainly prioritise any products which proclaimed it
Do we need to make one?
(Posting as separate comment because wall of text)
Also - Honestly don't even understand why so many people would need to scrape for training data.
Is it naïveté (thinking it necessary), arrogance (thinking it better than others, and thus justified)?
Aren't most advances now primarily focused on either higher level (agents, layered retrieval) or lower level (eg. alternatives to transformers, etc.. which would be easier to prove useful on existing datasets)?
Genuine questions, all of these - if I'm off the mark I'm keen to learn!
I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.
If you don't have public repos you can use gitolite and have ssh only access.
I used and am now using plain git over ssh. I only hosted gitea to have my code accessible for anyone interested.
My "main" projects I want to keep public are moved/cloned to GitHub, the rest I just don't bother with any longer.
If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?
At the top of the article:
> This blog post is expressing personal experiences and opinions and doesn’t reflect any official policies of SourceHut.
He's just right
Unsurprisingly, this submission is far removed from the front page.
The people stealing and ddos-ing literally do not care. Nobel prizes are handed out for this crap. Politicians support the steal.
Thanks Drew for being an honest voice.
Have there been efforts to set up something like the SMTP block lists for web scraping bots? Seems like this is something that needs to be crowd sourced. It'll be much easier to find patterns in larger piles of data. A single real user is unlikely to do as much as quickly as a bot.
Those are largely ineffective in the modern world.
Most bot nets, like he mentions in the blag post, come from 1 IP address only 1 time. With thousands to tens of thousands of IP addresses in the bot net, there is just no way to block by IP address anymore.
just like with crypto, AI culture is fundamentally tainted (see this very thread where people are defending this shit). Legislation hasn’t caught up and doing what’s essentially a crime (functionally this is just a DDoS) is being sold as a cool thing to do by the people who don’t foot the bill. If you ever ask yourself why everybody else hates you and your ilk, this attitude is why. The thing I don’t understand with AI though is that without the host, your parasite is nothing. Shouldn’t it be in your best interest to not kill it?
I hear you, and there are even greater concerns than the “you kids get off of my lawn” vibe, e.g. the massive amounts of water required for cooling data centers.
But the “bubble” threat the post mentions is just emotional; things are accelerating so quickly that what is hype one day isn’t hype in a few months. LLM usage is just going to get heavier.
What should get better are filters that prevent bots from easily scraping content. Required auth could work. Yes, it breaks things. Yes, it may kill your business. If you can’t deal with that, find another solution, which might be crowdfunding, subscription, or might be giving up and moving on.
Working towards a solution makes more sense than getting angry and ranting.
The playing fields is very asymmetric. Can you negotiate with these guys? No. They're faceless when it comes to their data operations. Secretive, covert even.
Creating a solution requires cooperation. Making these models fit on smaller systems, optimizing them to work with less resources need R&D engineering, which no company wants to spend on right now. Because hype trains need to be built, monies need to be made, moats need to be dug (which is almost impossible, BTW).
My 10 year old phone can track 8 people in camera app in real time, with no lag. My A7-III can remember faces, prioritize them, focus on mammals' and humans' eyes, track them and keep them in focus at 30FPS with a small DSP.
Building these things is possible, but nobody cares about it right now, because AI Datacenters live in a similar state of ZIRP economy. They're almost free for what they do, even though they harm the environment in irreparable ways just to give you something which is not right.
As if people lost their mind about this thing.
People’s basic needs are food and water, followed by safety, belonging, esteem, and finally self-actualization. In addition, convenience over non-convenience.
With LLMs come a promise of safety in that they will solve our problems, potentially curing disease and solving world hunger. They can sound like people that care and make us feel respected and useful, making us feel like we belong and boosting our esteem. They do things we can’t do and help us actualize our dreams, in written code and other things. We don’t have to work hard to use them anymore, and they’re readily available.
So, people are going to focus on that.
Ah.
We have been using models, and accurate ones at that, for drug discovery, nature simulation, weather forecasting, ecosystem monitoring, etc. well before LLMs with their nondescript chat boxes have arrived. AI was there, the hardware was not. Now we have the hardware, AI world is much richer than generative image models and stochastic parrots, but we live in a world where we assume that the most noisy is the best, and worthy of our attention. It's not.
The only thing is, LLMs look like they can converse while giving back worse results than these models we already have, or develop without conversation capabilities.
LLMs are just shiny parrots which can taunt us from a distance, and look charitable while doing that. What they provide is not correct information, but biased Markov chains.
It only proves that humans are as gullible as other animals. We are drawn to shiny things, even if they harm us.
I would like to add that european llm projects made from openly available sources do exist and were previously discussed here on hn. They're much smaller model than the top llm other countries have made, but iirc they're made ethically (and are open source).
The culture of "fuck you got mine" is something i despise on a very deep level, not just including ai but most big tech shitfuckery that has been going on since forever
Where [1] is cryptocurrency fundamentally tainted?
Do you also think E-Mail is fundamentally tainted because spam is widespread? [2]
[1] https://bitcoin.org/bitcoin.pdf
[2] https://www.emailtooltester.com/en/blog/spam-statistics/
please read my post again. crypto culture is tainted. cryptocurrency is just tech, just as ML is just tech. crypto culture however is rug pulling, rampant scams etc. Email, again, is just tech and there is no real email culture around (mailing lists?)
I really wonder how to best deal with this issue, almost seems like all web traffic needs to be behind CDNs which is horrible for the open web.
The internet is evolving to a state that if you don't have a n level deep stack, you're cooked.
Which is ugly in so many levels.
I see a future where there are many more digital gated communities. I don’t mean walled gardens in the old sense like AOL or Facebook. I mean people will create private networks with wireguard or something similar. All the services are on the inside. Discoverability is sacrificed to achieve security and privacy.
Dark forest but the very thing we fear could destroy our internet gated communities is big tech/tech companies
It sounds reasonable, and i'm not joking
Sounds like a BBS accessed over encrypted channels.
I don't think that any CDN is offering adequate protection from these attacks. I wish they did. The only solution I see remains the same -- remove all your stuff from the open web. That's what I've done.
I'm constantly looking for some way that I can put everything back up, but nothing has presented itself yet. Plus, if OpenAI's attempts at legalizing what they do are successful, it seems unlikely there will ever be an adequate solution.
Assuming you can prove it's a company, doesn't the behavior equate to fraud? Seems to hit all the prongs, but I am no lawyer.
It doesn't matter if it's fraud, "AI" is now considered an arms race, if we require them to fairly acquire their content we'll fall behind China or another country and then America might lose the WWIII it is constantly preparing for.
You might see a couple of small players or especially egregious executives get a slap on the wrist for bad behavior but in this political climate there's no chance that Republicans or Democrats will put a stop to it.
why is this story with 200 votes not on the front page?
The internet you knew and mastered no longer exists.
its cool because you ruined the internet for someone older too
When I think of the countless software jobs that automated other people out of work perhaps it is poetic justice that we also did it to ourselves.
What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to such aggressive scanning to extract a profit. Even faking the user agent and using residential IPs is really scummy.
They're doing something awesome, why wouldn't they (in their own words), I ask. What they do boils my blood, honestly.
Assuming that established companies are automatically ethical and just is not correct. Meta used a "laptop and a client tuned to seed as minimally as possible to torrent 81TBs of copyrighted material" to train their models.
For every picture perfect and shiny facade, there's an engineer sitting on the floor in corner and hacking things to prop up this facade.
> Meta used a "laptop and a client tuned to seed as minimally as possible to torrent 81TBs of copyrighted material" to train their models.
Frankly, as a person committing copyright infringement 24/7 I have no problem with that. Especially with them releasing the weights.
I'll print, frame and hang this comment.
I'm sure you'll have no problem with that, either.
Should they have a problem with that? What problem would you have with me doing that with your comment?
> Should they have a problem with that?
No.
> What problem would you have with me doing that with your comment?
Nothing.
Addenda:
However, while I'll be putting that comment to my favorite screenshots folder, and letting it adorn my cubicle for some time, there's also an implicit meaning. With that wording, the subscript is "That's brave". [0]
It's not meant to be an insult, but to signify a greatly different point of view compared to OP (me, in this case).
While the tone may came across as angry, slightly agitated is a more correct measurement of my feelings.
In either way, it's fair game and no hard feelings. Also, the answer from the original commenter is good sport. Thanks mate!
[0]: https://english.stackexchange.com/questions/574974/etymology...
I see. I didn’t take it quite so literally on first read. It initially read (and still does if I want it to, I guess) like people being upset about reddit using their posts for AI training, which I’ve never understood. The second sentence in particular.
> ...people being upset about reddit using their posts for AI training,...
I'm one of these people, and I'm not visiting or contributing to Reddit anymore. My logic is simple: "Ask me first". I may say yes, I may say no, but ask me first.
I'm bullish about ethics and transparency. I work on FAIR Data at work, as a side quest. Transparency, consent and licenses are a big part of it. Honoring it is important at every level. Both ethically, and both for integrity reasons.
As a result, when someone does something behind me, and if they didn't ask me first, I leave when I find out. I was this close to abandoning Go when they first declared that the telemetry will be opt-out. Now it's opt-in, and so I still use Go.
It's not that I don't send usage data, either. Some applications and parties I trust get telemetry from me, but all of them were opt-in and asked me first.
In short, a bit of decency and agency goes a long way in my eyes.
Be my guest.
I can only rejoice a bit thanking the fact their models can and will be distilled and perhaps copied in a few months in other less expensive and maybe open source models
>What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to
Stop defaulting to expecting honor from businesses.
Unless you're clutching pearls for dramatic effect, any surprise at all is an indication your orientation is wrong.
Or an indication that I'm from a place where regulations exist.
> All of my sysadmin friends are dealing with the same problems.
I'm seeing this as well. Some of the websites my company operates suddenly go through some 10-25x in requests per minute at night. Most often the AS where the ip comes from is from some Microsoft datacenter (meaning, it's most likely OpenAI).
At this point i'm starting to consider blocklisting Microsoft AS numbers and/or requiring login (or some kind of api key) when coming from a datacenter network. Reddit does this as well, already, for example.
> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.
This sounds... childish (to put it politely) ?
The solution here is simple: require user logins for data retrieval. Each user gets a quota as per a standard bucket algorithm. Don't overcomplicate a non-issue.
requiring logins for /everything/ would close up the web. Now... requiring a login for viewing git blame (which is resource intensive) might be a good idea...
All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.
Once the allmende, the grass runs out, things get interesting. We shall see Cyperpunk like computational parasitism plaguing companies and attempts to filter these work out. I guess that is the only way, to really prevent that sort of bot. You take arbitrary pieces of the UOW they want done- and reject them on principal. And depending on the cleverness of the batch algorithm, they will come back with that again and again, identifying themselves via repetition.
What would happen if these AI crawlers scraped data that was deliberately incorrect? Like a subtly broken algorithm implementation.
They've already scraped my Github.
They’d take that into account when relaying information. Poisoning LLMs is well documented.
You're assuming it would make a difference: they're already just hoovering up 'usual stuff' and splicing it all sorts of ways. For these purposes there's no difference between 'deliberately incorrect' and 'just not very good at its job'.
No point in taking effort to recreate what's already out there as authentic not-goodery :)
Welcome to 2025...
I work for a hosting and I know what this is like. And, while I completely respect that you don’t want to give out your resources for free, a properly programmed and/or cached website wouldn’t be brought down by crawlers, no matter how aggressive. Crawlers are hitting our clients sites all the same but you only hear problems from those who have piece of shit websites that take seconds to generate.
git blame is always expansive to compute; and precomputing (or caching) it for every revision of every file is going to consume a lot of storage.
I guess for computationally expensive things the only real option is to put it behind a login. I’m sure this is something SourceHut doesn’t want to do but maybe it’s their only decent option.
On SourceHut, git blame is available while logged out, but the link to blame pages is only showed to logged-in users. That may be a recent change to fight scrapers.
Precomputing git blame should take the same order of magnitude of storage as the repository itself. Same number of lines for each revision, same number of lines changed in every commit.
Should be easy to write a script that takes a branch and constructs a mirror branch with git blame output. Then compare storage space used.
It is more fun to fight LLMs rather than trying to create magical unicorn caches that work for every endpoint.
I had a friend that said he'd never get a mobile phone, and he did hold out until maybe 2010. He eventually realized the world had changed.
> I had a friend that said he'd never get a mobile phone, and he did hold out...
Up until this point you had a good start to a very inspiring story. ;)
If it helps, I have a friend who still doesn't have a mobile phone.
Tell me everything about it! Can he do an AMA on HN? ;)
devhc (/dev/humancontroller)? :D (doubt, but would be interesting).
Was with drew until the solution:
>Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.
Anyone with any inkling of the current industry understands this is /Old-Man-Yells-At-Cloud. Why offer this as a solution even, I'm unsure.
Here's potential solutions: 1) Cloudflare. 2) Own DDoS protecting reverse proxy using captcha etc. 3) Poisoning scraping results (with syntax errors in code lol), so they ditch you as a good source of data.
Now ask me how I know that your preferred solution to not stabbed in the chest while walking on the street is to just wear stab vests wherever you go.
I hear your idealist's take. The idealist in me agrees 100%.
Here' my pragmatist comeback: Is your preferred solution to lack of democracy in middle-east to go and protest in front of the palace?
to stay with the analogy, people won’t wear the stab-proof vest or at least not stop there. People will stab back. The magnitude of the attack by bad actors in the AI space is potentially life ruining and they shouldn’t expect not to have their lives ruined in retaliation if this keeps up.
I disagree with the premise of this article, which is: the internet used to be so nice when it wasn't the center of humanity's activities but now everything sucks.
I read this: "I am sick and tired of having all of these costs externalized directly into my fucking face", as: "I am sick and tired of doing something for the world and the world not appreciating it because they want to do their own things".
Lack of appreciation is underrated as a problem because it's abstracted away due to the mechanics of the free market ("if you don't want it, just don't buy it"). Yet, markets are quite foundational in humanity's activities and a market is place where everybody can put a stall. Now, Drew can't have his stall and he's angry. But anger is rarely part of the solution.
His path is that of anger, unfortunately, with some hate added in too:
> "If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts."
There's a significant difference between lack of appreciation and vandalism.