In recent months, various artists, content creators, and newspapers have claimed that AI companies should respect the copyrights of works that have been used to train LLMs.1 Some of them have already submitted lawsuits. The main question for regulators is: should the material used in AI training datasets be covered by copyright?
In the US (and other countries such as the UK and Australia), automated scraping of information has traditionally been falling under the category of “fair use”. For instance, scanning material from websites and showing it in a search engine has been judged as a “transformative use” (one of the criteria used to assess whether is fair use or not). The opinion of AI companies is that the use of databases to produce new material (such as pictures of stories) is transformative as well, therefore it’s legal.
In the EU, the 2019 Copyright Directive assigned to people publishing content online the right to opt-out of data scraping (Article 4(3)), even though the widespread implementation of this right requires a standardisation process that is still lagging. The AI Act confirmed that this right applies also to data scraping to train LLMs, and some AI companies seem to favour this “opt-out” approach.
In the following months and years, courts will rule on this topic and policymakers will be pressured to make new legislation. In making these decisions, they should consider at least three trade-offs.2 Increasing copyright protection rights in AI datasets may lead to a decrease in the following aspects:
Accuracy
I do not believe that LLMs should be evaluated (and regulated) as instruments for reliable information. However, if we think that the quality of information provided by LLMs is a social good, restricting the datasets accessible by these models could reduce their reliability. In other words, we can’t have our cake (limit access to information available to LLMs) and eat it too (requesting that LLMs provide up-to-date information). Moreover, AI biases are often caused by the lack of diversity in databases. Are we sure that hindering access to a significant share of data on the web is the right choice to prevent such biases?
2. Competition
If AI companies are forbidden to use public data, they will switch to copyright-proof training datasets. Big companies will easily bargain and buy licenses for large datasets, as they are already doing. Smaller firms, however, will be forced to use only cheap or open-source data. And which companies are, at the moment, more likely to create marketplaces of data licensing rights? That’s right: Big Tech. In a market that already shows the first signs of concentration, copyright would probably reduce competition by strengthening the rent-seeking positions of a few big players and reducing the data available for entrants.
Innovation
We tend to forget that AI tools are increasingly used by… artists. If AI becomes a mainstream tool in some fields (which seems likely), limiting datasets through copyright might mean limiting the power of the tools used by artists. This has already happened: when fair use was ruled to not cover music sampling, it killed a lot of creative sampling and now popular sample-based music tends to use the same library of samples. Finally, reducing available data to the whole industry might reduce the overall innovation and competitiveness of one country’s AI ecosystem. For instance, China’s blocking of “controversial” training datasets might condemn its AI industry to permanently stay behind the US.
Two reflections can be addressed to regulators and artists.
First, copyright is not a silver bullet: there are trade-offs that might ultimately harm more than benefit the creative industry. The history of copyright is full of cases, and lawsuits, where copyright protections have enriched wealthy intermediaries while suffocating the creation and dissemination of content by artists, writers and musicians. Copyright usually benefits deep-pocketed actors, and applying it to AI datasets might devolve into a fight among big corporations (and against less-pocketed developers?) to vacuum up as much AI-generated content as possible through copyright takedowns.
Second, against the “AI-will-steal-our-job” narrative, artists are probably the best positioned to use the new technological tools. To create valuable outputs, AI-based generation requires skills to distinguish what makes an artwork “good” (this is why markets for prompts are emerging) and often benefits from the integration of more traditional, “high-skill” capabilities (i.e., like using Photoshop or knowing the existence of various art styles). For this reason, artists and designers will likely become the most proficient users of AI tools, just like painters benefitted from photography, far from being displaced by it.
For these reasons, we should be cautious in using copyright to create long-term gatekeepers in AI to obtain short-term, probably risible, profits for the creative industry.
A large language model (LLM) like OpenAI’s GPT is a type of computational model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by training on huge amounts of data.
A trade-off is a decision that involves diminishing the quality of some aspects in return for gains in other aspects. In simple terms, a tradeoff is where one thing increases, and another must decrease.