I fear for the unauthenticated web

(sethmlarson.dev)

117 points | by SethMLarson 4 months ago ago

124 comments

cxr 4 months ago
Perversely, this submission is essentially blogspam. The article linked in the second paragraph, to which this "1 minute" read adds almost nothing of value, is the main story:
<https://thelibre.news/foss-infrastructure-is-under-attack-by...>
394 comments. 645 points. Submitted 3 hours ago: <https://news.ycombinator.com/item?id=43422413>
[-]
- btown 4 months ago
  But also ironically, it's almost heartwarming these days to see blogspam that's not machine-generated! A real live human cared enough about an article to write a brief (perhaps only barely substantial, but at least handwritten) take on it!
  It's reminiscent, perhaps, of the feel and motivation for Tumblr reblogs - and Tumblr continues to be vibrant by virtue of this culture: https://www.tumblr.com/engineering/189455858864/how-reblogs-... (2019)
  Now, is driving attention and reputation to their site (in the broadest senses) part of a blogspammer/reblogger's motivation? Absolutely!
  But should we be concerned about rewarding their act of curation, as long as there is at least some level of genuine curation intent? A world where that answer is categorically "no" would be antithetical, I think, to the concept of the participatory web.
  [-]
  - dkkergoog 4 months ago
    "heartwarming ... To see blogspam" the internet was a mistake
    [-]
    - wongarsu 4 months ago
      The internet was great, everything we did with it in the last 20 years with it was the mistake. Collimating in a comment that blogspam can now be one of the positive notes in the hellscape we are building.
      A very useful hellscape though, for all its flaws
      [-]
      - seec 4 months ago
        The internet IS great. The problem is similar to everything else in the world: the ratio of idiots to smart people using it is widely in favor of the idiots and they get to influence where it goes, how it operates and many other things.
        In other words, democracy sucks but we have not found something so much better that it would pay for the benefit of freedom for everyone...
      - 38 4 months ago
        culminating
        [-]
        wongarsu 4 months ago
        Thanks, that's the right word. Now at least you know the comment wasn't written by AI
        [-]
        ryandrake 4 months ago
        It won't be long until AI makes spelling, grammar and usage errors, if it's trained on things like message boards where people still don't know the difference between were, we're and where, lose vs. loose, affect vs. effect and wary vs. weary.
    - 4 months ago
      [deleted]
  - 4 months ago
    [deleted]
- MisterTea 4 months ago
  I dont feel this is blog spam it's more of a quick comment of the situation pointing to the actual article. I dont see anything wrong with writing a short post boosting or commenting on another article. There are no ads so I dont see this as blogspam which I associate with financial gain or clout.
- tempfile 4 months ago
  It also linked to https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali..., which is another worthwhile read.
  [-]
  - 4 months ago
    [deleted]
- Cheer2171 4 months ago
  All the time I see links on HN front page to Twitter and Mastodon posts with just as little text to them. Why does it upset you when it is in the medium of blogs, but not micro blogs?
  [-]
  - 4 months ago
    [deleted]
- SethMLarson 4 months ago
  Hehe, just participating in POSSE :) Funnily enough the story you're linking to quotes me with pictures of a story I wrote (https://sethmlarson.dev/slop-security-reports) about LLM-generated reports to open source projects.
hugs 4 months ago
I might be naive, but I think it's time we seriously start implementing "HTTP status code 402: Payment Required" across the board.
"L402" is an interesting proposal. Paying a fraction of a penny per request. https://github.com/l402-protocol/l402
[-]
- cwmma 4 months ago
  this is basically what they are doing, but instead of charging actual money they are making visitors spin the CPU ideally in a proof of work problem, which has the same outcome from the crawlers perspective.
  [-]
  - fewsats 4 months ago
    I've talked with tons of publishers and all say the same thing:
    "Hey, we'd happily give these companies clean data if they just paid us instead of building these scrapers."
    I think there is a psychological aspect that made micropayments never work for humans but machines may be better suited for it.
  - woah 4 months ago
    This has existed for decades. The proof of CPU work is called "frontend frameworks"
- rambambram 4 months ago
  I stumbled upon this status code last year - had never heard of it before - and I bookmarked it and then forgot about it. Thanks for the reminder.
- SoftTalker 4 months ago
  This is ultimately the answer. If something has value, users should pay for it. We haven't had a good way to do that on the web, so it has resulted in the complete shitshow that most websites are.
fewsats 4 months ago
There's a real economic problem here: when someone scrapes your site, you're literally paying for them to use your stuff. That's messed up (and not sustainable)
It seems like a good fit for micropayments. They never took off with people but machines may be better suited for them.
L402 can help here.
https://l402.org
[-]
- fewsats 4 months ago
  The other obvious solution is a "web of trust" where Cloudflare just tells you "this request goes in, this one goes out".
  I think the paying approach is superior (after all you make money out of people using your service) but Cloudflare is a straight forward/simpler one.
- tqwhite 4 months ago
  Aren't you paying for me to use the site, too? Or Google? Isn't the point of paying for a web hosting service to distribute information?
  [-]
  - fewsats 4 months ago
    Yes, but there is a "free lunch" problem. I can run a script that hits your page costing you X at a fraction of the cost for me (the user)
    *Edit: typo
    [-]
    - tqwhite 4 months ago
      I think the whole internet is a free lunch problem as far as that goes. I pay for web hosting because I consider the cost to be worth it to send my fabulous opinions into the ether.
      The premise of this thread is that somehow the LLM builders are reading too much. I bet it's less than google.
      I continue to believe, if you don't want everyone in the world to see and use your stuff, don't put it on the internet.
Aurornis 4 months ago
Rate limiting is the first step before cutting everything off behind forced logins.
> This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly
FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.
[-]
- snerbles 4 months ago
  Cloudflare also locks out non-Chrome/Firefox browsers, stifling the development of alternatives.
  [0] https://news.ycombinator.com/item?id=42953508
- blibble 4 months ago
  I get the feeling that I'm going to read a blog post in a few years telling us that the CDN companies have been selling everything pulled through their cache to the AI companies since 2022
  [-]
  - Aurornis 4 months ago
    CDNs are a cash cow. They’re not going to set their reputation on fire and violate all of their security guarantees for negligible amounts of money.
    [-]
    - littlestymaar 4 months ago
      What reputation?! Cloudflare has been known for its shady practices for more than a decade now, but people just don't care.
      [-]
      - seec 4 months ago
        Cloudflare is modern day protection racket and I don't get how people don't see that. Their offering when it comes to CDN and associated software is really not that good. Even their public DNS sucks. They always advertise some insane capabilities but I never experienced those benefits myself.
        And every once in a while, Cloudflare denies me entry to a site because it thinks I'm a robot, they are not even able to make proper heuristics to separate robot from humans so I don't know why anyone should trust them...
    - AshamedCaptain 4 months ago
      I know a lot of companies that not only willingly send their most precious trade secrets (TM) freely to shady LLM operators (like OpenAI, Microsoft, etc.) , but they even pay for the privilege of doing it ...... just out of fear of "missing out" on this Next Big Thing.
    - blibble 4 months ago
      cloudflare continues to make a loss
      meanwhile: "I'm proud of how our team continued to deliver ground-breaking innovation, especially in AI" (Matthew Prince, co-founder & CEO of Cloudflare)
    - 4 months ago
      [deleted]
    - koakuma-chan 4 months ago
      Cloudflare is free
      [-]
      - theamk 4 months ago
        for individual users.
        Don't worry, they charge plenty from big websites.
    - mystified5016 4 months ago
      See absolutely every other sector of industry and economy for copious counter-examples.
      If there's profit on the table, capitalism will not allow it to sit there at any cost.
      [-]
      - sepositus 4 months ago
        The argument is that there's less profit in selling out user trust. Of course, this heavily depends on the context, especially if you have a regulatory lock-in where users can't leave. However, there are cases where keeping user trust _is_ the profitable route.
        [-]
        littlestymaar 4 months ago
        Oh you mean like when Apple spent year claiming to be the champion of privacy when they in fact recorded all of what you said all the time and send it to its team to improve Siri[1]?!
        Did you see a mass exodus from Apple after that? I have seen barely any coverage this being covered outside of the French speaking world…
        There's too many shady practices for the consumer to track, and you cannot even move to an alternative since everybody is doing bad stuff.
        [1]: https://www.politico.com/news/2020/05/20/apple-whistleblower...
        [-]
        seec 4 months ago
        People don't operate rationally when it comes to Apple. They have been so good at manipulating people that even when they get criticism it's always in a very tone down way.
        Apple privacy stance is bullshit and disgusting. I know for a fact that they collect as much information as Google (in fact in my case, looking at the takeout of data they know much more about me than Google) but the pretense is that they don't use it or sell it. Yet the average user has zero way to make sure of that and the incentive are so bad that's it just a question of when Apple stance will change (anything is possible for profits, Tim Cook has very much proved that).
        If Apple was truly serious about privacy, they would refuse any type of cloud offering and completely disowned their App Store model. But the only thing Apple cares about is money, privacy is just a very easy marketing target without having to prove anything.
        Microsoft is shady, Google is borderline, Facebook is careless but by far, the evilest of the bunch is Apple.
        sepositus 4 months ago
        I understand where you're coming from, I'm just saying there's been many companies I've worked for where protecting the user _was_ the most profitable play. And that's what we did. Yes, you're less likely to see it in bigger corporations with ecosystem lock-in (like Apple) because the math begins to sway towards exploiting users being more profitable.
        DrillShopper 4 months ago
        This kind of thing always starts with the free tier and then creeps into every other tier. I can absolutely see Cloudflare doing this for free tier users "in order to recover costs"
  - nottorp 4 months ago
    And even if they don't, is everything depending on Cloudflare to stay online a good thing?
    [-]
    - sshine 4 months ago
      It’s a terrible thing.
      Cloudflare is the company I hate the most: I think (what I know of) their tech is done right, and they’re just too big to put my eggs in their basket.
    - koakuma-chan 4 months ago
      Why is nobody building a better product?
      [-]
      - seec 4 months ago
        Because their "product" isn't one born of actual needs and has to do with inherent weakness to the way internet and some protocol works.
        Using Cloudflare is not a permanent fix, just a bandage, and it's particularly bad that they'll use their quasi monopoly into strongarming business to pay large fees. It's basically racketeering, legal.
        If we are talking about the CDN and associated "software", last time I checked (a long time ago admittedly) it was nothing special.
      - nisegami 4 months ago
        Because Cloudflare's products are very good. The issues people have with Cloudflare are more philosophical and won't be resolved by building a better product. GP's concerns would best be rectified by dozens of commodity-like competitors of similar quality.
        [-]
        nottorp 4 months ago
        Exactly. Not even social networking is bad per se. It’s concentration of power that’s bad.
        Considering recent events especially, is all the worlds traffic passing through a cdn subject to a certain jurisdiction a good thing?
      - Dwedit 4 months ago
        You mean why is nobody making arrangements to put their own server into thousands and thousands of ISP datacenters?
      - renewiltord 4 months ago
        Because their product is awesome dude. I don’t even know where to improve it. Love it as a user.
      - ttw44 4 months ago
        By snuffing out competition through economies of scale is one explanation.
- zwnow 4 months ago
  Until they threaten you to pay a huge bill or they will shutdown your services. No thanks. Cloudflare has extremely questionable business practices.
  [-]
  - sshine 4 months ago
    Cloudflare took down our website: https://news.ycombinator.com/item?id=40481808
    A user running an online casino claimed that Cloudflare abruptly terminated their service after they refused to upgrade to a $10,000/month enterprise plan. The user alleged that Cloudflare failed to communicate the reasons clearly and deleted their account without warning.
    Quote: "Cloudflare wanted them to use the BYOIP features of the enterprise plan, and did not want them on Cloudflare's IPs. The solution was to aggressively sell the Enterprise plan, and in a stunning failure of corporate communication, not tell the customer what the problem was at all."
    ——
    Tell HN: Don't Use Cloudflare: https://news.ycombinator.com/item?id=31336515
    Summary: A user shared their experience of being forced to upgrade to a $3,000/month plan after using 200-300TB of bandwidth on Cloudflare's business plan. They criticized Cloudflare's lack of transparency regarding bandwidth limits and aggressive sales tactics.
    Quote: "A lot of this stuff wasn't communicated when we signed up for the business plan. There was no mention of limits, nor any contracts nor fineprint."
    ——
    Tell HN: Impassable Cloudflare challenges are ruining my browsing experience: https://news.ycombinator.com/item?id=42577076
    Summary: A user expressed frustration with Cloudflare's bot protection challenges, which made it difficult for them to unsubscribe from emails or access websites. They highlighted how these challenges disproportionately affect privacy-conscious users with non-standard browser configurations.
    Quote: "The 'unsubscribe' button in Indeed's job notification emails leads me to an impassable Cloudflare challenge. That's a CAN-SPAM act violation."
    [-]
    - seec 4 months ago
      It's modern racketeering.
      If you don't need them, they'll make you think you need them (so they can monitor your needs) and when you do need them, they will extort you any way they can.
      The vast majority of websites don't need Cloudflare, very often people do because they run things in a very terrible way. Instead of paying Cloudflare extortion feed, pay competent people for proper infrastructure development.
- dougb5 4 months ago
  What exactly should be rate-limited, though? See the discussion here -- https://news.ycombinator.com/item?id=43422413 -- the traffic at issue in that case (and in one that I'm dealing with myself) is from a large number of IPs making no more than a single request each.
- layer8 4 months ago
  Centralizing large parts of the web behind Cloudfare is something to be feared as well.
- harha_ 4 months ago
  Screw cloudflare, I rather host my own proxies.
parliament32 4 months ago
Linked in the article that this article links to is a project I found interesting for combatting this problem, a (non-crypto) proof-of-work challenge for new visitors https://github.com/TecharoHQ/anubis
Looks like the GNOME Gitlab instance implements it: https://gitlab.gnome.org/GNOME
[-]
- kh_hk 4 months ago
  For targeted scrapes, isn't proof of work trivial to bypass?
  1. headless browser 2. get cookie 3. use cookie on subsequent plain requests
  [-]
  - parliament32 4 months ago
    It doesn't sound like the scrapers are that smart yet, but when they get there, presumably you'd just lower the cookie lifetime until the requests are down to an acceptable level. It takes a split-second in my browser so it shouldn't interfere much for human visitors.
hubraumhugo 4 months ago
We should try separating good bots from bad bots:
Good bots: search engine crawlers that help users find relevant information. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits. AI agents like OpenAI's Operator or Anthopic's Computer Use probably also fit into that bucket as they are offering useful automation without negative side effects.
Bad bots: bots that have a negative affect website owners by causing higher costs, spam, or downtime (automated account creation, ad fraud, or DDoS). AI crawlers fit into that bucket as they disregard robots.txt and spoof user agent. They are creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.
So the actual question is how good bots and humans can coexist on the web while we protect websites against abusive AI crawlers. It currently feels like an arms race without a winner.
[-]
- jsheard 4 months ago
  Discriminating search engine bots is pretty straightforward, the big names provide bulletproof methods to validate whether a client claiming to be their bot is really their bot. It'll be an uphill battle for new search engines if everyone only trusts Googlebot and Bingbot though.
  https://developers.google.com/search/docs/crawling-indexing/...
  https://www.bing.com/webmasters/help/verifying-that-bingbot-...
kmeisthax 4 months ago
> How long until scrapers start hammering Mastodon servers?
Mastodon has AUTHORIZED_FETCH and DISALLOW_UNAUTHENTICATED_API_ACCESS which would at least stop these very naive scrapers from getting any data. Smarter scrapers could actually pretend to speak enough ActivityPub to scrape servers, though.
jmclnx 4 months ago
I would think all you need to do is add a copyright statement of some kind.
Sad things are getting to this point. Maybe I should add this to my site :)
(c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.
[-]
- jcranmer 4 months ago
  The argument the AI companies are making is that training for LLMs is fair use which means a copyright statement means fuck all from their point of view. (Even if it does, assuming you're in the US, unless you register the copyright with the US copyright office, you can only sue for actual damages, which means the cost of filing a lawsuit against them--not even litigating, just the court fee for saying "I have a lawsuit"--would be more expensive than anything you could recover. Even if you did register and sued for statutory damages, the cost of litigation would probably exceed the recovery you could expect.)
  Of course, the big AI companies are already trying to get the government to codify AI training as fair use and sidestep the litigation which doesn't seem to be going entirely their way on this matter (cf. https://arstechnica.com/google/2025/03/google-agrees-with-op...).
  [-]
  - tsumnia 4 months ago
    In addition, we need to start paying attention to the growing legislation about AI and copyright law. There was an article on HN I think this week (or last) specifically where a judge ruled AI cannot own copyright on its generated materials.
    IANAL, but I do wonder how this ruling will be used as a point of reference whenever we finally ask the question "Does material produced by GenAI violate copyright laws?" Specifically if it cannot claim ownership, a right that we've awarded to trees and monkeys, how does it operate within ownership laws?
    And don't even get me ranting about HUMAN digital rights or Personified AIs.
  - tqwhite 4 months ago
    Fair use requires transformation. LLM is as transformative as it gets. If I'm on the jury, you're going to have to make new copyright law for me to convict.
    I am personally happy to have everyone, people and LLM alike, learn from my wisdom.
    [-]
    - jcranmer 4 months ago
      > Fair use requires transformation.
      No, it doesn't. There are four factors for fair use, and whether the use is transformative is part of one of them. And you don't need to win on all four factors.
      > LLM is as transformative as it gets.
      The current ruling precedent for "transformative" is the Warhol decision, which effectively says that to look at whether or not something is transformative, you kind of have to start by analyzing its impact on the market (and if you're going "doesn't that import the fourth factor into the first?" the answer is "yes, I don't like it, but it's what SCOTUS said"). By that definition, LLMs are nowhere near "transformative."
      Even pre-Warhol, their role as "transformative" is sketchy, because you have to remember that this is using its legal definition, not its colloquial definition.
      > If I'm on the jury
      Fortunately, for this kind of question, the jury isn't going to be involved in determining fair use, so it doesn't matter what you think.
      [-]
      - tqwhite 4 months ago
        That's untrue. See my comment elsewhere around here. It doesn't rely on the commercial aspect, though if it's not commercial the bar for fair use is set lower.
        The argument in Warhol relies on the fact that the derivative work, ie, Warhol's painting, is substantially similar in function to the original photograph. If Warhol had used the picture as stuffing for a soft sculpture, it would not infringe.
        LLM is closer to the latter than the former.
      - tqwhite 4 months ago
        "so it doesn't matter what you think"
        A perfectly fine, if incorrect reply, but then you have to be a dick. Why?
- Aurornis 4 months ago
  Copyright is for topics like redistribution of the source material. You can’t add arbitrary terms to a copyright claim that go beyond what copyright law supports.
  I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.
  [-]
  - 101008 4 months ago
    What about if my index says "This are the EULA, by clicking "Next" or "Enter", you are accepting them", and a LLM scrapper "clicks" Next to fetch the rest of the content?
    [-]
    - aaronbaugher 4 months ago
      That's how the big software companies have been doing it to us for years, so it does seem like turnabout would be fair play.
      [-]
      - WesolyKubeczek 4 months ago
        "Ah, but for us, it's unenforceable, it's for you little people."
        "Copyright? Well if you are a big label, we probably need to talk. Little people? Oh fuck you, just give us your money and creative output."
- jefftk 4 months ago
  It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license. This is what the https://githubcopilotlitigation.com/ class action (from 2022) is about, and its still making its way through the court. This prediction market has it at 12% likely to succeed, suggesting that courts will not agree with you: https://manifold.markets/JeffKaufman/will-the-github-copilot...
  [-]
  - jcranmer 4 months ago
    > It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license.
    I would say it's not reasonably likely that LLM training is fair use. Because I've read the most recent SCOTUS decision on fair use (Warhol), and enough other decisions on fair use, to understand that the primary (and nearly only, in practice) factor is the effect on the market for the original. And AI companies seem to be going out of their way to emphasize that LLM training is only going to destroy the market for the originals, which weighs against fair use. Not to mention the existence of deals licensing content for LLM training which... basically concedes the point.
    Of the various options, a ruling that LLM training is fair use I find the least likely. More likely is either that LLM training is not fair use, that LLM training is not infringing in the first place, or that the plaintiffs can't prove that the LLM infringed their work.
    [-]
    - tqwhite 4 months ago
      I do not read it that way at all. The Goldsmith decision mainly turns on the idea that an artist protections include that for derivative works. Warhol produced a work that does substantially the same things as Goldsmith's, ie, is a picture that can be viewed.
      When talking about parody, they note that the usage as the foundation for parody is always substantially different from the original and thereby allowed, even if it would otherwise infringe. LLMs are always substantially different from the original, too.
      If I want to write software that draws that picture exactly, the code would not be a copyright violation. It is text and cannot be printed in a magazine as a picture. If I used it to print a picture that was a derivative work and sold that, it might be.
      A large language model has no intersection with the picture or, for that matter, anything that it absorbs. It is possible that someone might figure out how to prompt it to do exactly the same picture as Goldsmith did but fairly unlikely.
      Unless you could show that this was easy, common and part of the intent of the LLM creator, I can see no possibility that it is infringing.
  - maeln 4 months ago
    > This prediction market has it at 12% likely to succeed
    Randos on the internet with a betting addiction are distinctively different from a court of law. I wish people would stop talking about prediction market as if they mattered.
    [-]
    - eudhxhdhsb32 4 months ago
      Participants in prediction market do not need to be experts for their collective input to be informative.
      There's a long history of economic research on the "wisdom of crowds" that backs up their value.
  - dingnuts 4 months ago
    this isn't about copyright but about computer access. the CFAA is extremely broad; if you ban LLM companies from access on grounds of purpose you have every legal right to do so
    in theory that legislation has teeth, too. they are not allowed to access your system if you say they are not; authentication is irrelevant.
    every GET request to a system that doesn't permit access for training data is a felony
  - waveringana 4 months ago
    why are we pretending that these gambling sites have any weight on anything
    [-]
    - eudhxhdhsb32 4 months ago
      What do you mean by weights?
      I'd certainly trust their predictions more than those given by most "experts".
- JohnFen 4 months ago
  Such a notice is legally meaningless, though. Doubly so if the courts rule that scraping for AI purposes counts as fair use.
- kerkeslager 4 months ago
  This is pretty naive.
  The only reason copyright is so strong in the US is that there are big players (Disney, Elsevier) who benefit from it. But gig tech is much bigger, and LLMs have created a situation where big tech has a vested interest in eroding copyright law. Both sides are gearing up for a war in the court systems, and it's definitely not a given who will win. But, if you try to enter the fray as an individual or small company, you definitely aren't going to win.
- jasperr1 4 months ago
  The reality is that a lot of these small websites have very permissive licenses. I really hope we don't get to the point where we must all make our licenses stricter.
  [-]
  - krapp 4 months ago
    The reality is that none of these LLM scrapers give a damn about copyright, because the entire AI industry is built on flagrant copyright violation, and the premise that they can be stopped by a magic string is laughable.
    You could sue, if you can afford it, meanwhile all of your data is already training their models.
    [-]
    - jasonjayr 4 months ago
      A class action, funded by their rivals could hurt quite a bit, especially for sites damaged monetarily by these LLM scrapers.
- jeffwask 4 months ago
  Sure, because Meta certainly followed copyright law to the letter when they torrented thousands of copyrighted books from hundreds of published and known authors to train Lama. Forgive me if I doubt a text disclaimer on the page will slow them down.
- dspillett 4 months ago
  Unfortunately copyright is no limit to these companies.
  Meta is stating in court that knowingly downloading pirated content is perfectly fine (ref https://news.ycombinator.com/item?id=43125840) so they for one would have absolutely no issue completely ignoring your copyright notice and stated licensing costs. Good luck affording a legal team to try force them to pay attention.
  Copyright is something for them to beat us with, not the other way around, apparently.
charcircuit 4 months ago
Crawlers visiting every page on your website is not the main problem with the unauthenticated web.
The amount of spam that happens when you let people freely post is a much bigger problem.
4 months ago
[deleted]
renegat0x0 4 months ago
To be honest I feel that web2 is overrated.
Most of content, blogs could be static sites.
For mastodon, forums I think user validation is ok and a good way to go.
0x1ceb00da 4 months ago
Do I need to be worried about my bill if I've rented a simple EC2 instance without any fancy autoscaling stuff?
[-]
- simonw 4 months ago
  Probably not. Keep an eye on bandwidth usage since you'll be charged for that but you would need to attract an incredible amount of bot traffic for that to add up to anything meaningful.
  The thing to watch out for is platforms like Vercel or Google Cloud Run where you get charged more for compute if you attract crawlers, potentially unbounded (make sure to set up spending limits if you can.)
MontgomeryPy 4 months ago
Could an answer here be for smaller websites to convert their sites into chatbots which could prevent AI scrapers from slurping up all their content/drive up their hosting costs?
[-]
- cwmma 4 months ago
  no
napolux 4 months ago
> I suggest everyone that uses cloud infrastructure for hosting set-up a billing limit to avoid an unexpected bill in case they're caught in the cross-hairs of a negligent company. All the abusers anonymize their usage at this point, so good luck trying to get compensated for damages.
This is scary
[-]
- jsheard 4 months ago
  What's scarier is that most of the big clouds don't even let you set up a billing limit.
anovikov 4 months ago
Pretty soon virtually everything will be paywalled. Ironically, it will provide us with a good metric that lets us find out whether AGI has arrive or not: when it does, paywalling will stop working because AGI could derive more value from accessing things and will thus outbid us.
woah 4 months ago
If you don't want someone to access your website, don't put it online
isoprophlex 4 months ago
Everyone is (rightfully) outraged, but this is essentially nothing new. Asshat capitalists have been externalizing the costs of their asshat moneymaking schemes on the little guy since approximately forever.
Deregulation is ultimately antithetical to our personal freedom.
I just hope the spirit of the internet that I grew up with can be rescued, or reincarnated somehow...
ToucanLoucan 4 months ago
Yet another entry in the long and shameful history of Silicon Valley abusing the public square for their own profit (or in this case, fantasies of profit) and the rest of us just have to learn to live with it because the justice system simply will not even try and give us recourse.
Move fast and break things apparently has a bonus clause for the things you break not being your responsibility to fix.
[-]
- Analemma_ 4 months ago
  I don't think the justice system is the one to blame here. Right up until LLMs and their huge datamining operations appeared, everyone in tech was strongly for unrestricted scraping. Everybody here cheered the LinkedIn decision [0], saying "it's on the public web: if you didn't want it to be scraped, you should've put it behind authentication". LLMs change nothing about the legal landscape, they've just convinced everyone on an emotional level that unrestricted scraping is no longer an automatic good. It's not the justice system's job to react to such vibe shifts, the laws themselves have to be changed.
  [0]: https://news.ycombinator.com/item?id=21241395
  [-]
  - ToucanLoucan 4 months ago
    I'm not talking about the ethics of scraping itself. I think scraping is fine from an ethics perspective for exactly that reason. I think LLM companies, however, are scraping ineptly and with poorly implemented tooling, which is causing problems for the websites they're targeting, and that sucks ass and they should be held liable.
    On the legal end though, I do think there's a few things that should be done:
    * Scrapers should be CLEARLY, and CORRECTLY identified as what they are, and who they are being dispatched from. Changing user agents to get around blocks should not be permitted, ever. If you only get a certain amount of content or a certain subset of pages when you identify as a scraper, that is a choice the website operator is making and it should be respected.
    * Scrapers MUST OBEY robots.txt. We didn't create that for a fun hacker weekend. It's an important technical component of how we organize websites and how we want them crawled, if we want them crawled. It should be the first stop for any scraper on any website, and again, it should be respected.
    * Scrapers should always meter their traffic with respect to the website owner. Pounding an entire website's library of content request after request with only milliseconds between is, to put it bluntly, being a fucking asshole. And not just to the owner, but to anyone else attempting to use the site at the time.
    If a website operator configures their site incorrectly and pages they don't want scraped are, or pages they do want scraped aren't, then that is on them and they need to fix that. It is not in the scraper's purview to end-run around that configuration to "be real sure" they got everything they were meant to, and it's especially not that to get things the web operator has explicitly tried to not let the scraper have.
    And yes, all of these things should be legally actionable, with financial penalties attached and for serial offenders, we should have a registry of scraper bots that we disallow entirely because they are acting in bad faith.
  - kerkeslager 4 months ago
    Scraping is only part of the problem with LLMs. I don't care if you scrape my public data. The problem is re-publication, without even so much as attribution. LLMs should not be taking credit for my work.
- dannyobrien 4 months ago
  I feel like there's been a lot of assumptions going on, but not much testing. For instance, somebody has said that a lot of these bots are coming from Chinese IP ranges. Is that true? What percentage, vs. say Amazon regions? I would love more data!
  [-]
  - kerkeslager 4 months ago
    Frankly, I don't care.
    I didn't give any LLM permission to train on my data, Chinese or otherwise. It's theft and I have zero recourse to do anything about it.
    [-]
    - eudhxhdhsb32 4 months ago
      If you don't want others to use your data, perhaps you should have kept it private?
      [-]
      - kerkeslager 4 months ago
        I want people to use my data.
        I don't want people to redistribute my data without attribution, claiming it as their own.
      - blibble 4 months ago
        and this results in the end of the internet
paranoidroid 4 months ago
[dead]
JKCalhoun 4 months ago
For some reason I am not really moved by a lot of the hand wringing I am seeing lately.
It's a not a binary thing to me: LLMs are not god, but even without AGI, they have proven wildly useful to me. Calling them "shitty chat bots" doesn't sway me.
Further I have always assumed that everything that I post to the web is publicly accessible to everyone/everything. We lost any battle we thought we could wage some 2+ decades ago when web crawlers started hoovering up data from our sites.
[-]
- prosody 4 months ago
  This article isn’t about that. It’s about the externalized costs that LLM companies are pushing onto webmasters because of their aggressive scraping. It’s one thing to believe that LLMs are a good thing, it’s another thing to believe that individuals and cooperative groups that run small internet services ought to be the ones to pay for that good.
- theamk 4 months ago
  It's not about secret vs public, it's about resource overload on the websites. Existing crawlers so far mostly respected robots.txt, the LLM crawlers don't.
  You, as a user, might not care, but as servers keep going down, more and more website owners start blocking LLMs. Good riddance, hopefully all good stuff gets locked down.
  Or to use an analogy, your comment is similar to: "sure those delivery vans violate speed limits and occasionally hit the pedestrians. I don't care, those fast deliveries have been proven wildly useful to me"
- 101008 4 months ago
  I have published a lot of content about a particular topic online and I want it to be publicly accesible. A lot of people use it to create YouTube videos, that's fine (and a lot of them even cite me). I have a problem with LLM profiting from them.
  Which now I realize is not different from people amaking YouTube videos. I feel there is a difference but I don't know how to explain it. Maybe there isn't. Ouch, writing this comment was not a good idea...
  [-]
  - chunky1994 4 months ago
    This difference in emotional reaction is because of the effort involved in the process. Functionally, we see YouTube video creation as a fundamentally difficult exercise (to do well) and results in a singular product (one video). Any additional content would need an ongoing investment of time and money from the creator. The LLMs though would not require an ongoing investment beyond the first training run, that is probably why you have a problem with it, they're an extremely high leverage way of taking advantage of content.
  - ergonaught 4 months ago
    You are confronted with automation.
    Individuals who have to do work in order to use your content to do work to create their own content is qualitatively different than automation trivially doing whatever.
  - zoogeny 4 months ago
    What you are feeling is described in The Work of Art in the Age of Mechanical Reproduction [1]
    1. https://en.wikipedia.org/wiki/The_Work_of_Art_in_the_Age_of_...
- gjsman-1000 4 months ago
  I agree even if I have mixed feelings on it.
  To me this feels almost like the news complaining that they want a "link tax." Weren't their headlines and summaries used? It seems inconsistent to somehow say that AI and scraping is not okay; but that news companies should also not be entitled to their link tax. It's okay to index, but not that kind of index.
  [-]
  - theamk 4 months ago
    It seems pretty cut-and-dry to me: the website owner should opt-out if they don't like the deal (being indexed in this case).
    In the "link tax" case, there were plenty of trivial ways to opt out of headline usage - robots.txt, http headers, http tags. The problem was newspapers did not want to opt out (as they were benefiting from Google themselves), so they wanted a 3rd option. Which was pretty stupid of course - if you don't like the deal, don't take it; suing the offering party for a better deal is not a good long-term strategy.
    In the AI case, there is no opt-out. All those websites already indicated they want to opt-out via robots.txt, but the AI companies ignore robots.txt, change user-agent, fake IPs, and so on - do the things that are normally done by shady malwar-ish services rather than multi-billion-dollar companies.
    It really bothers me when people don't see the difference between those two cases.
- TacticalCoder 4 months ago
  [dead]
- paxcoder 4 months ago
  [dead]