It is mentioned later in the article but I think it's important to clearly draw a distinction between cases where a) The "offender" is using the licensed work within the letter of the license but not the spirit b) The "offender" has broken both the letter and spirit of the license.
I've licensed multiple repositories under MIT, written under CC-BY, and published games under ORC. All of those licenses require attribution, something that AI, for example, explicitly ignores. In those situations "Wait, no, not like that" isn't "I didn't expect you'd use it this way" it's "you weren't authorized to use it this way."
Listing all of the creators from whose attribution-licensed works an LLM (potentially) derived an output would seem to satisfy the letter of such licenses, but it is not clear that such would satisfy the spirit (which seems to assume a stronger causal link and a more limited number of attributions). If creators can be grouped outside of the creator naming explicitly associated with the works, this could degrade into "this work is derived from the works of humanity"; however, listing all human beings individually does not seem _meaningfully_ different and seems to satisfy the attribution requirement of such licenses.
From what little I understand of LLMs, the weight network developed by training on a large collection of inputs is similar to human knowledge in that some things will be clearly derived (at least in part) from a limited number of inputs but others might have no clear contributor density. If I wrote a human "superiority" science fiction story, I could be fairly confident that Timothy Zahn and Gordon R. Dickson "contributed"; however, this contribution would not be considered enough to violate copyright and require licensing. Some LLM outputs clearly violate copyright (e.g., near verbatim quotation of significant length), but other outputs seem to be more broadly derived.
If the law treats LLMs like humans ("fairly"), then broad derivation would not seem to violate copyright. This seems to lead toward "AI rights". I cannot imagine how concepts of just compensation and healthy/safe working conditions would apply to an AI. Can a corporation own a computer system than embodies an AI or is that slavery?
If the law makes special exceptions for LLMs, e.g., adjusting copyright such that fair use and natural learning only apply to human persons, then licensing would be required for training. However, attribution licenses seem to have the above-mentioned loophole. (That this loophole is not exploited may be laziness or concern about admitting that following the license is required — which makes less openly licensed/unlicensed works poisonous.)
If the purpose of copyright is to "promote the useful arts", then the law should reflect that purpose. Demotivating human creators and the sharing of their works seems destructive to that purpose, but LLMs are also enabling some creativity. Law should also incorporate general concepts such as equality under the law. LLMs also seem to have the potential for power concentration, which is also a concern for just laws.
Perhaps I am insufficiently educated on the tradeoffs to see an obvious solution, but this seems to me like a difficult problem.
Big tech has no respect for licenses, or the law itself with Uber and the like.
People talking about licenses like they have some courtroom legitimacy is hilarious. Licenses are like patents, they are weapons of the large corporation to be used against other large corporations or the people.
Of course you can try to litigate for ten years against a big corporation with lawyers on retainer, good luck. Might even get an "Erin Brockovich" movie script out of it, but we're seeing the rule of law and legitimacy of the courts degrade rapidly and become increasingly corrupt, once over the years, but now by the day
The internet as an open resource has been a tremendous boon to society as a whole. AI, likewise, acts as a force multiplier on top of this knowledge. You can learn anything. But AI also depletes the resources on which it builds.
An obvious way to counteract this is for the AI companies to give back generously through monetary donations or, at the very least, attributions. But unfortunately we see exactly the opposite.
> But AI also depletes the resources on which it builds.
The article author is clearly assuming that, but I'm not sure whether it's even true to any meaningful extent. How many contributors to free/open content resources are even bothered that their work might end up being used for AI training?
Also, maybe AI firms should pay for the scanning and OCR text-extraction of existing paper-backed resources. There's a lot of, e.g. old academic research that's still not meaningfully available online, much of it is even free of copyright worldwide. If you care about ensuring that your AI's are adequately well-read, this is a bottleneck that might be especially easy to address.
It's not trained on "the commons", it's trained on a large number of individual products with their own ownership and licensing. If copyrights are being violated the restitution will go to those owners, not to "everyone".
This is a perfect modern example of the Tragedy of the Commons, where the absence of governance mechanisms or social contracts around a shared resource (open knowledge) leads to exploitation that threatens the resource's sustainability for everyone.
If implementing a full game theory solution is challenging, a minimal viable approach could combine:
- Wikimedia's Enterprise API model for high-volume users
- Technical measures to identify and throttle non-contributing scrapers
- Public transparency reports on AI company usage and contributions
- Industry certification program for "commons-friendly" AI development
This hybrid approach uses game theory principles to realign incentives while being practically implementable with current technologies and organizational structures.
> absence of governance mechanisms or social contracts around a shared resource (open knowledge) leads to exploitation that threatens the resource's sustainability for everyone.
an AI training on a source does not deplete the source in the digital realm. It's not like bits run out or rot.
The training potentially could remove a source of commercial revenue, such as advertising, which is only tangentially related to the knowledge/source. And it is not the commercial revenue which drove the initial development of said knowledge, even if said creator(s) of the knowledge got paid after the fact (let alone not paid - such as wikipedia editors).
Therefore, it's incorrect to associate tragedy of the commons to ai training draining the commons.
If you care about software freedom, you need to make all your software AGPL.
MIT is the "wait, no, not like that" license and GPL is a half-measure.
Non-commercial licenses are fine if you also provide a commercial option - who cares what the OSI thinks. (And you might want to look up who's a member of the OSI)
If you care about software, why would non-commercial licenses be fine under any condition?
First rule of free software is any goal, including commercial ones.
> MIT is the "wait, no, not like that" license
I don't get this. Someone who releases their under MIT, GPL or AGPL allows selling a copy on amazon and could think "wait, no, not like that" regardless.
I think the reasonable stance is to (kindly, humanly) ask, that one doesn't do such or such thing even if it's legal.
What I worry about is once the robots become the main source of information then how will the corps restrict what information is fed into them to support whatever bias they have.
For example, just yesterday someone posted "noam chomsky is a genocide denier" so I went internet sleuthing to see what they were talking about. I first asked google and then ended up on the "Bosnian genocide denial" wikipedia page. I read the argument and checked the sources and concluded that, maybe, someone could make that claim.
Today, in response to TFA, I asked deepseek and received a well-rounded and, IMHO, unbiased response to the same question which summarized the arguments from both sides. The only problem is they cite no sources so you just have to trust the response or do as I did yesterday and go to the google.
Personally, if someone makes an extraordinary claim I'm going to go digging to find out what they're talking about, if their argument is based on fact and if you can draw their conclusion from the facts. Take that ability away and we're just a bunch of sheep for the Silicon Valley Billionaires Club to fleece.
It is mentioned later in the article but I think it's important to clearly draw a distinction between cases where a) The "offender" is using the licensed work within the letter of the license but not the spirit b) The "offender" has broken both the letter and spirit of the license.
I've licensed multiple repositories under MIT, written under CC-BY, and published games under ORC. All of those licenses require attribution, something that AI, for example, explicitly ignores. In those situations "Wait, no, not like that" isn't "I didn't expect you'd use it this way" it's "you weren't authorized to use it this way."
Listing all of the creators from whose attribution-licensed works an LLM (potentially) derived an output would seem to satisfy the letter of such licenses, but it is not clear that such would satisfy the spirit (which seems to assume a stronger causal link and a more limited number of attributions). If creators can be grouped outside of the creator naming explicitly associated with the works, this could degrade into "this work is derived from the works of humanity"; however, listing all human beings individually does not seem _meaningfully_ different and seems to satisfy the attribution requirement of such licenses.
From what little I understand of LLMs, the weight network developed by training on a large collection of inputs is similar to human knowledge in that some things will be clearly derived (at least in part) from a limited number of inputs but others might have no clear contributor density. If I wrote a human "superiority" science fiction story, I could be fairly confident that Timothy Zahn and Gordon R. Dickson "contributed"; however, this contribution would not be considered enough to violate copyright and require licensing. Some LLM outputs clearly violate copyright (e.g., near verbatim quotation of significant length), but other outputs seem to be more broadly derived.
If the law treats LLMs like humans ("fairly"), then broad derivation would not seem to violate copyright. This seems to lead toward "AI rights". I cannot imagine how concepts of just compensation and healthy/safe working conditions would apply to an AI. Can a corporation own a computer system than embodies an AI or is that slavery?
If the law makes special exceptions for LLMs, e.g., adjusting copyright such that fair use and natural learning only apply to human persons, then licensing would be required for training. However, attribution licenses seem to have the above-mentioned loophole. (That this loophole is not exploited may be laziness or concern about admitting that following the license is required — which makes less openly licensed/unlicensed works poisonous.)
If the purpose of copyright is to "promote the useful arts", then the law should reflect that purpose. Demotivating human creators and the sharing of their works seems destructive to that purpose, but LLMs are also enabling some creativity. Law should also incorporate general concepts such as equality under the law. LLMs also seem to have the potential for power concentration, which is also a concern for just laws.
Perhaps I am insufficiently educated on the tradeoffs to see an obvious solution, but this seems to me like a difficult problem.
Think of it in the positive way. They also don't attribute all rights reserved works.
Big tech has no respect for licenses, or the law itself with Uber and the like.
People talking about licenses like they have some courtroom legitimacy is hilarious. Licenses are like patents, they are weapons of the large corporation to be used against other large corporations or the people.
Of course you can try to litigate for ten years against a big corporation with lawyers on retainer, good luck. Might even get an "Erin Brockovich" movie script out of it, but we're seeing the rule of law and legitimacy of the courts degrade rapidly and become increasingly corrupt, once over the years, but now by the day
This reminds me of “Fuck you, pay me” a talk[0] given by Mike Monteiro on contract work (I believe the title is based on a quote from Goodfellas[1]).
[0] https://m.youtube.com/watch?v=jVkLVRt6c1U
[1] https://m.youtube.com/watch?v=P4nYgfV2oJA
Great post. I love the vampirism metaphor.
The internet as an open resource has been a tremendous boon to society as a whole. AI, likewise, acts as a force multiplier on top of this knowledge. You can learn anything. But AI also depletes the resources on which it builds.
An obvious way to counteract this is for the AI companies to give back generously through monetary donations or, at the very least, attributions. But unfortunately we see exactly the opposite.
> But AI also depletes the resources on which it builds.
The article author is clearly assuming that, but I'm not sure whether it's even true to any meaningful extent. How many contributors to free/open content resources are even bothered that their work might end up being used for AI training?
Also, maybe AI firms should pay for the scanning and OCR text-extraction of existing paper-backed resources. There's a lot of, e.g. old academic research that's still not meaningfully available online, much of it is even free of copyright worldwide. If you care about ensuring that your AI's are adequately well-read, this is a bottleneck that might be especially easy to address.
If you train your AI on the commons, everything it generates should be in the commons. And all your profits should be shared with everyone.
If running your AI is incompatible with respecting copyright and intellectual property then you should not get to own a single bit of its output.
It's not trained on "the commons", it's trained on a large number of individual products with their own ownership and licensing. If copyrights are being violated the restitution will go to those owners, not to "everyone".
>If you train your AI on the commons, everything it generates should be in the commons.
Isn't AI output not copyrightable already?
This is a perfect modern example of the Tragedy of the Commons, where the absence of governance mechanisms or social contracts around a shared resource (open knowledge) leads to exploitation that threatens the resource's sustainability for everyone.
If implementing a full game theory solution is challenging, a minimal viable approach could combine:
- Wikimedia's Enterprise API model for high-volume users - Technical measures to identify and throttle non-contributing scrapers - Public transparency reports on AI company usage and contributions - Industry certification program for "commons-friendly" AI development
This hybrid approach uses game theory principles to realign incentives while being practically implementable with current technologies and organizational structures.
> absence of governance mechanisms or social contracts around a shared resource (open knowledge) leads to exploitation that threatens the resource's sustainability for everyone.
an AI training on a source does not deplete the source in the digital realm. It's not like bits run out or rot.
The training potentially could remove a source of commercial revenue, such as advertising, which is only tangentially related to the knowledge/source. And it is not the commercial revenue which drove the initial development of said knowledge, even if said creator(s) of the knowledge got paid after the fact (let alone not paid - such as wikipedia editors).
Therefore, it's incorrect to associate tragedy of the commons to ai training draining the commons.
I always appreciate a good post by Molly White :D
This is such good writing, and manages to offer a nuanced and informative new angle on an issue which has already been discussed at great length.
If you care about software freedom, you need to make all your software AGPL.
MIT is the "wait, no, not like that" license and GPL is a half-measure.
Non-commercial licenses are fine if you also provide a commercial option - who cares what the OSI thinks. (And you might want to look up who's a member of the OSI)
Restricting what people can do with the software you write doesn't sound very free
If you care about software, why would non-commercial licenses be fine under any condition?
First rule of free software is any goal, including commercial ones.
> MIT is the "wait, no, not like that" license
I don't get this. Someone who releases their under MIT, GPL or AGPL allows selling a copy on amazon and could think "wait, no, not like that" regardless.
I think the reasonable stance is to (kindly, humanly) ask, that one doesn't do such or such thing even if it's legal.
What I worry about is once the robots become the main source of information then how will the corps restrict what information is fed into them to support whatever bias they have.
For example, just yesterday someone posted "noam chomsky is a genocide denier" so I went internet sleuthing to see what they were talking about. I first asked google and then ended up on the "Bosnian genocide denial" wikipedia page. I read the argument and checked the sources and concluded that, maybe, someone could make that claim.
Today, in response to TFA, I asked deepseek and received a well-rounded and, IMHO, unbiased response to the same question which summarized the arguments from both sides. The only problem is they cite no sources so you just have to trust the response or do as I did yesterday and go to the google.
Personally, if someone makes an extraordinary claim I'm going to go digging to find out what they're talking about, if their argument is based on fact and if you can draw their conclusion from the facts. Take that ability away and we're just a bunch of sheep for the Silicon Valley Billionaires Club to fleece.