"Wait, not like that": Free and open access in the age of generative AI

(citationneeded.news)

126 points | by thinkingemote 7 months ago ago

49 comments

healsdata 7 months ago
It is mentioned later in the article but I think it's important to clearly draw a distinction between cases where a) The "offender" is using the licensed work within the letter of the license but not the spirit b) The "offender" has broken both the letter and spirit of the license.
I've licensed multiple repositories under MIT, written under CC-BY, and published games under ORC. All of those licenses require attribution, something that AI, for example, explicitly ignores. In those situations "Wait, no, not like that" isn't "I didn't expect you'd use it this way" it's "you weren't authorized to use it this way."
[-]
- Paul_Clayton 7 months ago
  Listing all of the creators from whose attribution-licensed works an LLM (potentially) derived an output would seem to satisfy the letter of such licenses, but it is not clear that such would satisfy the spirit (which seems to assume a stronger causal link and a more limited number of attributions). If creators can be grouped outside of the creator naming explicitly associated with the works, this could degrade into "this work is derived from the works of humanity"; however, listing all human beings individually does not seem _meaningfully_ different and seems to satisfy the attribution requirement of such licenses.
  From what little I understand of LLMs, the weight network developed by training on a large collection of inputs is similar to human knowledge in that some things will be clearly derived (at least in part) from a limited number of inputs but others might have no clear contributor density. If I wrote a human "superiority" science fiction story, I could be fairly confident that Timothy Zahn and Gordon R. Dickson "contributed"; however, this contribution would not be considered enough to violate copyright and require licensing. Some LLM outputs clearly violate copyright (e.g., near verbatim quotation of significant length), but other outputs seem to be more broadly derived.
  If the law treats LLMs like humans ("fairly"), then broad derivation would not seem to violate copyright. This seems to lead toward "AI rights". I cannot imagine how concepts of just compensation and healthy/safe working conditions would apply to an AI. Can a corporation own a computer system than embodies an AI or is that slavery?
  If the law makes special exceptions for LLMs, e.g., adjusting copyright such that fair use and natural learning only apply to human persons, then licensing would be required for training. However, attribution licenses seem to have the above-mentioned loophole. (That this loophole is not exploited may be laziness or concern about admitting that following the license is required — which makes less openly licensed/unlicensed works poisonous.)
  If the purpose of copyright is to "promote the useful arts", then the law should reflect that purpose. Demotivating human creators and the sharing of their works seems destructive to that purpose, but LLMs are also enabling some creativity. Law should also incorporate general concepts such as equality under the law. LLMs also seem to have the potential for power concentration, which is also a concern for just laws.
  Perhaps I am insufficiently educated on the tradeoffs to see an obvious solution, but this seems to me like a difficult problem.
  [-]
  - turtleyacht 7 months ago
    From Holub on Patterns (2004), patterns are discovered, not invented. The implementation of patterns is the idiom, which may or may not be idiomatic based on a given community of practice.
    If it can be shown multiple people independently created something, the artifact is not the pattern itself--but the pattern is recognized because of so many similar implementations.
    LLMs create (or re-create) and derive idioms of things, based on weights which are idioms themselves (probabilistic patterns). Then we can only say AI may understand patterns of color theory, or idiomatic execution (art style)--but that is all.
    ---
    1. They are willful, purposeful creatures who possess selves.
    2. They interpret their behavior and act on the basis of their interpretations.
    3. They interpret their own self-images.
    4. They interpret the behavior of others to obtain a view of themselves, others, and objects.
    5. They are capable of initiating behavior so as to affect the view of others have of them and that they have of themselves.
    6. They are capable of initiating behavior to affect the behavior of others toward them.
    7. Any meaning that children attach to themselves, others, and objects varies with respect to the physical, social, and temporal settings in which they find themselves.
    8. Children can move from one social world to another and act appropriately in each world.
    -- The Private Worlds of Dying Children (1978)
  - tim333 7 months ago
    I guess like humans LLMs should cite their main sources? But not everything they once read.
- AtlasBarfed 7 months ago
  Big tech has no respect for licenses, or the law itself with Uber and the like.
  People talking about licenses like they have some courtroom legitimacy is hilarious. Licenses are like patents, they are weapons of the large corporation to be used against other large corporations or the people.
  Of course you can try to litigate for ten years against a big corporation with lawyers on retainer, good luck. Might even get an "Erin Brockovich" movie script out of it, but we're seeing the rule of law and legitimacy of the courts degrade rapidly and become increasingly corrupt, once over the years, but now by the day
- AlienRobot 7 months ago
  Think of it in the positive way. They also don't attribute all rights reserved works.
righthand 7 months ago
This reminds me of “Fuck you, pay me” a talk[0] given by Mike Monteiro on contract work (I believe the title is based on a quote from Goodfellas[1]).
[0] https://m.youtube.com/watch?v=jVkLVRt6c1U
[1] https://m.youtube.com/watch?v=P4nYgfV2oJA
egypturnash 7 months ago
If you train your AI on the commons, everything it generates should be in the commons. And all your profits should be shared with everyone.
If running your AI is incompatible with respecting copyright and intellectual property then you should not get to own a single bit of its output.
[-]
- unraveller 7 months ago
  All products made from access to the commons do eventually get back into the commons to benefit all. Thank Mickey Mouse if it's not fast enough for you.
  There's no cause to siphon all profit from this particular newcomer medium because it derives some benefit here and there.
- tylersmith 7 months ago
  It's not trained on "the commons", it's trained on a large number of individual products with their own ownership and licensing. If copyrights are being violated the restitution will go to those owners, not to "everyone".
- Hamuko 7 months ago
  >If you train your AI on the commons, everything it generates should be in the commons.
  Isn't AI output not copyrightable already?
stephantul 7 months ago
Great post. I love the vampirism metaphor.
The internet as an open resource has been a tremendous boon to society as a whole. AI, likewise, acts as a force multiplier on top of this knowledge. You can learn anything. But AI also depletes the resources on which it builds.
An obvious way to counteract this is for the AI companies to give back generously through monetary donations or, at the very least, attributions. But unfortunately we see exactly the opposite.
[-]
- zozbot234 7 months ago
  > But AI also depletes the resources on which it builds.
  The article author is clearly assuming that, but I'm not sure whether it's even true to any meaningful extent. How many contributors to free/open content resources are even bothered that their work might end up being used for AI training?
  Also, maybe AI firms should pay for the scanning and OCR text-extraction of existing paper-backed resources. There's a lot of, e.g. old academic research that's still not meaningfully available online, much of it is even free of copyright worldwide. If you care about ensuring that your AI's are adequately well-read, this is a bottleneck that might be especially easy to address.
  [-]
  - captn3m0 7 months ago
    > How many contributors to free/open content resources are even bothered that their work might end up being used for AI training?
    StackOverflow’s response is a clear counterpoint.
  - stephantul 7 months ago
    I don’t think that’s what this would look like. It’s more like: people won’t even know Wikipedia exists or that there is something to contribute to, or that the model gained its knowledge from this resource. That’s why this pattern is so disingenuous in my opinion. When I was young we were surprised that Wikipedia existed. Future generations might not know that Wikipedia exists, or that you can contribute to it.
    Current frontier models couldn’t exist in this form or wouldn’t exist at all without people putting in the work of writing down what they knew for free. And the model creators not even paying lip service to this, and instead saying they will replace Wikipedia, is hubristic and a very clear example of the tragedy of the commons.
    [-]
    - zozbot234 7 months ago
      > And the model creators not even paying lip service to this, and instead saying they will replace Wikipedia
      What model creators are saying this? It's very easy to put this assertion to the test btw: just pick your favorite Wiki article, ask a LLM to "improve" the writing, and check how much stuff it ends up rewriting in a confusing way, or even gets outright wrong. Compared to your average high-quality Wiki article, a LLM is just a very confused parrot.
      [-]
      - tyre 7 months ago
        I think in a year LLMs are better at writing Wiki articles than the p90 quality Wikipedia page today.
        They have access to _huge_ volumes of information—books, research papers, PhD dissertations, newsletters from experts, peer-reviewed articles. They're getting better every day at organizing it. Claude has a (beta) citations API, which forces the LLM to directly cite sources and use direct quotations.
        They'll get better and better at managing translations—maybe not them translating, but them finding resources in other languages, using a translation tool, then pulling in that information.
        Will they be perfect? No, but good lord is Wikipedia not either. Will they hallucinate, probably sometimes, but, again, the median Wikipedia author does not have a New Yorker fact checker at their side.
        I think we overestimate how good humans are at this. And it's far easier to validate cited sources than it is to consume and craft the original writing.
    - unraveller 7 months ago
      Why does wikipedia have to exist forever any more than encyclopedia Britannica? Both are gatekeeping knowledge behind editors, let the LLM creator decide what things are valuable after ingesting everything. You can contribute to LLMs this way.
      People can pay a tiny bit extra for source attribution at present so there is no short-term issue.
fudged71 7 months ago
This is a perfect modern example of the Tragedy of the Commons, where the absence of governance mechanisms or social contracts around a shared resource (open knowledge) leads to exploitation that threatens the resource's sustainability for everyone.
If implementing a full game theory solution is challenging, a minimal viable approach could combine:
- Wikimedia's Enterprise API model for high-volume users - Technical measures to identify and throttle non-contributing scrapers - Public transparency reports on AI company usage and contributions - Industry certification program for "commons-friendly" AI development
This hybrid approach uses game theory principles to realign incentives while being practically implementable with current technologies and organizational structures.
[-]
- chii 7 months ago
  > absence of governance mechanisms or social contracts around a shared resource (open knowledge) leads to exploitation that threatens the resource's sustainability for everyone.
  an AI training on a source does not deplete the source in the digital realm. It's not like bits run out or rot.
  The training potentially could remove a source of commercial revenue, such as advertising, which is only tangentially related to the knowledge/source. And it is not the commercial revenue which drove the initial development of said knowledge, even if said creator(s) of the knowledge got paid after the fact (let alone not paid - such as wikipedia editors).
  Therefore, it's incorrect to associate tragedy of the commons to ai training draining the commons.
pluto_modadic 7 months ago
I always appreciate a good post by Molly White :D
simonw 7 months ago
This is such good writing, and manages to offer a nuanced and informative new angle on an issue which has already been discussed at great length.
immibis 7 months ago
If you care about software freedom, you need to make all your software AGPL.
MIT is the "wait, no, not like that" license and GPL is a half-measure.
Non-commercial licenses are fine if you also provide a commercial option - who cares what the OSI thinks. (And you might want to look up who's a member of the OSI)
[-]
- jazzyjackson 7 months ago
  Restricting what people can do with the software you write doesn't sound very free
  [-]
  - simion314 7 months ago
    >Restricting what people can do with the software you write doesn't sound very free
    It depends on POV, user freedom or developer/publisher freedom. To make it easy to understand, I remember a quote from some book where a french and USAian were debating about what country is fmore free and slavery and the USAian says "We are more free since we are free to own slaves"
  - immibis 7 months ago
    Restricting people's freedom to restrict other people's freedom ensures maximum freedom overall. That's why we have prisons for thieves and killers.
- jraph 7 months ago
  If you care about software, why would non-commercial licenses be fine under any condition?
  First rule of free software is any goal, including commercial ones.
  > MIT is the "wait, no, not like that" license
  I don't get this. Someone who releases their under MIT, GPL or AGPL allows selling a copy on amazon and could think "wait, no, not like that" regardless.
  I think the reasonable stance is to (kindly, humanly) ask, that one doesn't do such or such thing even if it's legal.
  [-]
  - immibis 7 months ago
    Because the right attitude, IMO, is reciprocal altruism.
    You can use the thing for experimentation for free because why not.
    You can use the thing in your free thing for free because you are giving back to the world.
    If you want to use the thing in a non-free thing, you are no longer being altruistic so I won't either. You can use it, but the rule of "fuck you pay me" applies. I am not volunteering for a for-profit company. You're for-profit, so pay your workers.
    To implement this principle, the strictest GPL you can find (currently AGPLv3 or even SSPL) works and is open-source compatible, but I think non-commercial licenses also work, provided you can also get a commercial one. They have a problem if the company that produced the software disappears. AGPL probably works better.
    I think it's fine if Amazon hosts a modified version of a free thing, for money, if they release the source code to their modified version. All of their value is then coming from the hosting, not the modifications.
    [-]
    - jraph 7 months ago
      For posterity: I missed a crucial word in my comment. I meant to say "if you care about free software". Sorry for this.
  - Pannoniae 7 months ago
    "First rule of free software is any goal, including commercial ones."
    Yeah and that's what we disagree with. :P If Creative Commons has NC licences for art, why shouldn't software? Apart from programmers being too meek to assert their rights.
    And yes, the OSI is mostly hyperscalers, that's fairly commonly known. They aren't interested in protecting the commons the slightest.
    Even the JSON licence ("don't be evil") and stuff like that are better than MIT and friends - gives enough of a headache for bigcorp lawyers to think twice before stealing your stuff.
    [-]
    - lolinder 7 months ago
      > before stealing your stuff.
      You were at least consistent up until this point, but this phrasing runs directly counter to the rest of your thesis. It's not stealing to use software that you put out as FOSS to be used by anyone for any purpose. If you didn't intend to do that, that's on you to have picked a license that reflected your intentions and to accept the consequences of that choice (mostly likely less interest in using it from everyone, not just megacorps).
      There's no moral imperative to respect terms of use that existed only in the head of the developer—on the contrary, it's an immoral bait-and-switch to release your code as FOSS and then throw a fit when someone uses it to make money.
      [-]
      - 7 months ago
        [deleted]
      - thaZT 7 months ago
        It's a derivative work and stripping the license violates it. Why do people repeat this stupid corporate propaganda?
        [-]
        immibis 7 months ago
        MIT explicitly gives the right to sublicense. That means you give certain rights to Alice, who gives fewer rights to Bob. Bob isn't allowed to give away copies of the software Alice gives her, because Alice, being a smart businesswoman, made sure Bob's license agreement is a proprietary one. She's not violating MIT, because she put your name and a copy of the MIT license in notices.txt, but the license doesn't actually apply to the software.
        lolinder 7 months ago
        I'm really confused—are you talking about AI training? I'm talking about corporations building systems on top of FOSS.
        Also, why did you feel the need to create a throwaway for this comment?
      - immibis 7 months ago
        I think when you take something from the public domain and make it proprietary, that is close enough to stealing that it's appropriate to use the word colloquially.
        [-]
        lolinder 7 months ago
        Only if what you took from the public domain is actually taken—as in, no longer available to the public. The scarcity model that we use for the literal commons (the shared fields in a village) doesn't apply when we're talking about bits.
        I think TFA is onto something by pointing out that there are very real ways that some of the recent use of the commons falls into the realm of abuse that harms the commons. But most of the kinds of usage that has people giving up on FOSS doesn't actually fall into that category.
      - Pannoniae 7 months ago
        You are entirely right about my word choice, good point! (it's a bit ironic in retrospect)
        However, let's not pretend that choosing a licence is a fully informed decision free of any kind of pressure. If you pick a non-OSI licence, that has social costs (as you said, less interest from other developers for example)
        The problem is two-sided: both those companies exploiting the FOSS landscape and the participants in the FOSS landscape more concerning themselves with uploading the status quo than to try to do something about the problem.
        P.S.: The OSI are not even sellouts - they mostly consist of exactly those corporations themselves. The FSF are much better but the FSF's philosophy was mostly informed by RMS not being able to fix his printer in the 80's.... times change, y'know? A philosophy/worldview which worked for the 80s and the 90s might not be appropriate for the 2020s.
        [-]
        lolinder 7 months ago
        > If you pick a non-OSI licence, that has social costs (as you said, less interest from other developers for example)
        Yes. People make a decision to choose from the FOSS licenses because being seen as FOSS is valuable to them. That comes with tradeoffs, and it's unethical to expect to receive the benefits but not the drawbacks of your chosen license.
        > P.S.: The OSI are not even sellouts - they mostly consist of exactly those corporations themselves.
        I know. I find the bellyaching about the "spirit of Open Source" to be pretty ironic given that Open Source exists to "dump the moralizing and confrontational attitude that had been associated with "free software" in the past and sell the idea strictly on the same pragmatic, business-case grounds that had motivated Netscape" [0].
        > those companies exploiting the FOSS landscape
        Exploiting how? What harm does $MEGACORP using a FOSS project do to the FOSS project? Does it hurt that it gives them more credibility? For there to be exploitation there has to be a quantifiable harm to the exploited victim—it has to be win-lose, not win-neutral or win-win. So where's the loss to the FOSS maintainers as victims?
        [0] http://web.archive.org/web/20071115150105/https://opensource...
    - jraph 7 months ago
      > Yeah and that's what we disagree with. :P
      There's nothing to disagree about. Free software is a definition.
      Now we can talk about whether this is desirable of course, but this is not a discussion to put under "if you care about free software", that would be contradictory.
      > If Creative Commons has NC licences for art, why shouldn't software?
      CC NC licenses are non-free. And I certainly think there are things under CC ND that shouldn't (not speaking about all art, especially art that's not meant to have derivatives - especially art that's the expression of very personal emotions or views - I'm not set on this and CC has another clause for this: ND).
      > Even the JSON licence ("don't be evil")
      Yeah, no thanks. That was painful. Evil wasn't even defined.
      If you think something is evil, software license are not the right tools. Lobbying for the law to change and spreading the word is much more effective for this, broader, more universal, has the chance to be more impactful.
  - immibis 7 months ago
    If it's GPL, or AGPL, someone can sell a copy on Amazon, but it's pretty pointless because the person who buys it is allowed to give away more copies for free.
    If it's MIT, the person who buys it ISN'T allowed to give away more copies for free if the seller doesn't want them to. MIT means SOMEONE ELSE can (for all practical purposes) copyright YOUR software, and all you get is your name in a notices.txt file.
    [-]
    - jraph 7 months ago
      Quite some confusion here.
      > If it's GPL, or AGPL, someone can [...]. If it's MIT, the person who buys it ISN'T allowed to [...]
      MIT gives strictly more rights to the user than GPL. If you can do something under the (A)GPL, you can definitely do so under MIT. In all cases, as someone who receives a copy, you are allowed to sell copies regardless. Under the (A)GPL, you will need to provide the sources on request as well.
      > MIT means SOMEONE ELSE can (for all practical purposes) copyright YOUR software
      You always keep the copyright on your code (unless there is a copyright assignment), regardless of the license.
      MIT allows someone else to not share the source when redistributing some software under MIT or a derivative.
      [-]
      - immibis 7 months ago
        MIT gives the user the right to take away another user's rights, potentially leading to fewer rights overall.
        I give foobar to Alice under AGPL. Alice gives foobar to Bob, Carol, and Dave, under AGPL so I don't sue her.
        I give barbaz to Alice. It's MIT. Alice gives barbaz to Bob, Carol and Dave under a proprietary license. I can't do anything about it, because I gave her that right.
        Which scenario has more user freedom?
        [-]
        jraph 7 months ago
        Sure, and I happen to agree, I release stuff under (A)GPL for exactly this reason.
        I guess the confusion stems from this phrasing:
        > If it's MIT, the person who buys it ISN'T allowed to give away more copies for free if the seller doesn't want them to
        I assume you meant:
        > If it's MIT, a seller is allowed to relicense under a proprietary license and then the buyer isn't allowed to give away copies
        I had understood the opposite.
  - HeatrayEnjoyer 7 months ago
    Maybe that's a rule you have but don't speak for the rest of us.
    [-]
    - lolinder 7 months ago
      They should have put Free Software in capitals to be more clear, but they're referring to the Four Essential Freedoms as defined by the Free Software Foundation [0]:
      > The freedom to run the program as you wish, for any purpose (freedom 0). ... “Free software” does not mean “noncommercial.” On the contrary, a free program must be available for commercial use, commercial development, and commercial distribution. This policy is of fundamental importance—without this, free software could not achieve its aims.
      You're welcome to come up with a different set of coherent rules for an ethical model of software development, but to avoid confusion it would be best to use a different label than "free software" so we don't overload the term with conflicting definitions.
      [0] https://www.gnu.org/philosophy/free-sw.en.html#four-freedoms
- numpad0 7 months ago
  GenAIs don't care if your scrapables were licensed under AGPLv3 or GFDLv1 or were even proprietary, they ingest and become at-most MIT, then spits out its lesser parody under public domain.
  Under some versions of copyright laws and interpretations conceived before emergence of generative AI, this is either considered fair use and/or legally exempted as scientific research, because use of even copyrighted data were highly transformative; AI even capable of regurgitating dataset were previously unheard of.
  This nature of generative types of AIs are controversial, but progress of discussions in that regard - whether it's to be ultimately upheld or not - had been moving at a glacial pace(as in slow).
7 months ago
[deleted]
UncleEntity 7 months ago
What I worry about is once the robots become the main source of information then how will the corps restrict what information is fed into them to support whatever bias they have.
For example, just yesterday someone posted "noam chomsky is a genocide denier" so I went internet sleuthing to see what they were talking about. I first asked google and then ended up on the "Bosnian genocide denial" wikipedia page. I read the argument and checked the sources and concluded that, maybe, someone could make that claim.
Today, in response to TFA, I asked deepseek and received a well-rounded and, IMHO, unbiased response to the same question which summarized the arguments from both sides. The only problem is they cite no sources so you just have to trust the response or do as I did yesterday and go to the google.
Personally, if someone makes an extraordinary claim I'm going to go digging to find out what they're talking about, if their argument is based on fact and if you can draw their conclusion from the facts. Take that ability away and we're just a bunch of sheep for the Silicon Valley Billionaires Club to fleece.