Home / Blogs

Can Large Language Models Use the Contents of Your Website?

Large Language Models (LLM) like GPT-4 and its front-end ChatGPT work by ingesting gigantic amounts of text from the Internet to train the model and then responding to prompts with text generated from those models. Depending on who you ask, this is either one step (or maybe no steps) from Artificial General Intelligence, or as Ted Chiang wrote in the New Yorker, ChatGPT Is a Blurry JPEG of the Web. While I have my opinions about that, at this point, I’m considering what the relationship is under copyright law between the input text and the output text. Keeping in mind that I am not a lawyer, and no court has yet decided on an LLM case, let’s take a look.

Copyright law is about who is allowed to make copies of what, with the precise definitions of the terms surprisingly complicated. In this case, the what is all of the web material that LLMs use to train.

The next question is, what’s a copy? While the LLM output is rarely an exact copy of the training material (although some of the examples in the Getty Images vs. Stability AI case have recognizable Getty watermarks), but copies are not just literal copies. The law says:

A “derivative work” is a work based upon one or more preexisting works, such as [long list of examples], or any other form in which a work may be recast, transformed, or adapted.

It seems to me that the output of the LLM is a work based on preexisting works, namely, the websites it trained on. What else could it be based on? While there is likely some manual tweaking, that doesn’t change the result. If I take a document and translate it into another language or take a story and turn it into a play, it’s still a derivative work. My manual work makes me a coauthor, but it doesn’t wipe out the original.

One might make a de minimis argument that there is so much training data that the amount of any particular input document in any output is too small to matter. I find that unpersuasive. Depending on the question, the results might be based on a million sources or, for particularly obscure questions, a single source. Since LLMs are notoriously unable to show their work and, as often as not, makeup sources that don’t exist, the burden would be on the operator of the LLM to show how its outputs depend on its input, which at this point, they can’t do. We know that the outputs can sometimes obviously depend on identifiable inputs, e.g., the Getty logo example or computer code written with distinctive variable names.

If the LLM output is a derivative of the training data, the next question is whether that’s OK anyway. Under US law, “fair use” allows some kinds of copying. The law does not define fair use but gives judges four factors to use to evaluate fair use claims: the purpose and character of the use, the nature of the work, the amount copied, and the market or value effect on the work. In practice, the first factor is the most important, and courts look for use that is transformative, that it is a use different from that of the original. For example, in the Google Books case, the court found that Google’s book scanning was transformative because it created an index of words and snippets for people to search, which is quite different from the purpose of the books, which is for people to read them. On the other hand, in the recent Internet Archive case, the court found that the purpose of the Archive’s book scans was the same as for the paper books, for people to read, so that’s not transformative.

Defining the purpose of LLM output seems like a minefield. Maybe it’s the same as the source, maybe not, depending on the prompt. If you ask it how many people live in China, that’s a simple fact with a vast list of possible sources, and the purpose isn’t very interesting. But what if you ask it to write Python code to collect statistics from a particular social network’s website or to compare the political views of two obscure French statesmen? There aren’t likely to be many sources for questions like that, the sources are going to look a lot like what the LLM generates, and the purpose of the sources is likely the same as of the result.

The other criteria for fair use are a mixed bag. The nature of the work and the amount of material used will vary depending on the specific prompt and response. For the market or value effect, often, source pages have ads or links to upsell to a paid service. If the LLM output is a replacement for the source, the user doesn’t see the source, so it loses the ad or potential upsell revenue, which weighs against fair use.

All this analysis involves a fair amount of hand waving, with a lot of the answers depending on the details of the source material and what the LLM does with it. Nonetheless, it is easy to imagine more situations like Getty Images where the facts support claims that LLM output is a derivative work and is not fair use.

LLM developers have not done themselves any favors by dodging this situation. In Field vs. Google, the court held that it was OK for Google to copy Field’s website into its web cache, both because it was for the transformative purpose and also because it is easy to opt out using the robots exclusion protocol (ROBOTS.TXT files and the like) to tell some or all web spiders to go away. I have no idea how I would tell the various LLM developers not to use my websites, a problem they could easily have solved by also following ROBOTS.TXT. But OpenAI says:

The Codex model was trained on tens of millions of public repositories, which were used as training data for research purposes in the design of Codex. We believe that is an instance of transformative fair use.

Well, maybe.

One other legal approach, suggested by Ed Hasbrouck, is to assert the author’s moral rights described in the Berne Convention on copyright and demand credit for use of one’s material. I don’t think that’s a useful approach for two reasons. One is practical; the United States has made it quite clear in its law that moral rights only apply to works of visual art. Beyond that, I’d think that even in places that apply moral rights to written work, the same derivative work analysis would apply, and if an LLM used enough source material for moral rights to apply, it’d also be enough to infringe.

NORDVPN DISCOUNT - CircleID x NordVPN
Get NordVPN  [74% +3 extra months, from $2.99/month]
By John Levine, Author, Consultant & Speaker

Filed Under

Comments

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

Related

Topics

Domain Names

Sponsored byVerisign

Threat Intelligence

Sponsored byWhoisXML API

DNS

Sponsored byDNIB.com

Brand Protection

Sponsored byCSC

Cybersecurity

Sponsored byVerisign

New TLDs

Sponsored byRadix

IPv4 Markets

Sponsored byIPv4.Global