Fair Use and the Training of Large Language Models
The litigation over copyright and the training of large language models is, in many ways, the most consequential intellectual-property dispute of the decade. The plaintiffs in New York Times v. Microsoft & OpenAI6 and the dozens of analogous actions now pending in the federal courts seek to establish that the ingestion of copyrighted text into the training corpus of a generative model constitutes infringement not excused by the fair use doctrine codified at 17 U.S.C. § 107.1 The defendants advance, with substantial doctrinal support, the contrary position: that the training process is a paradigmatic transformative use, that the model's outputs do not ordinarily reproduce the protected expression of the training works, and that the commercial character of the use is outweighed by the public benefits associated with the creation of general-purpose reasoning systems.
The four-factor test and its application
Section 107 directs courts to consider four non-exhaustive factors: (1) the purpose and character of the use, including whether it is commercial in nature; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. The factors are weighed together, no one of them dispositive, and the inquiry is famously case-specific.
The first factor — purpose and character — supplies most of the doctrinal action in the LLM-training context. The defendants' theory rests heavily on Authors Guild v. Google, Inc.,2 in which the Second Circuit held that Google's mass digitization of copyrighted books for the purpose of building a searchable index was a transformative fair use. The training of a generative model, on this view, is functionally analogous to the indexing operation in Google Books: the original works are ingested in their entirety, but for the purpose of constructing a derivative artifact (a search index in Google, a statistical model in the LLM context) whose use does not substitute for the original expressive work.
The plaintiffs' principal counter-argument turns on the Supreme Court's recent decision in Andy Warhol Foundation v. Goldsmith,5 which narrowed the transformative-use inquiry by emphasizing the importance of the use's commercial purpose relative to the original. The argument runs that LLM training, unlike the indexing in Google Books, produces outputs that compete in the very markets the original works occupy — particularly when those outputs are commercially deployed to generate text that substitutes for licensed content.
The fourth factor and the market-substitution question
The fourth factor — effect on the market for the original — has emerged as the principal battleground. The plaintiffs in NYT v. OpenAI have produced evidence, partially replicated in the public record, that the defendant's models can be induced to reproduce substantial portions of Times articles verbatim when prompted in particular ways. If these reproductions occur with sufficient regularity to substitute for the plaintiff's licensed digital subscriptions or syndication arrangements, the fourth factor favors the plaintiff regardless of how the first factor is resolved. The defendants respond that such reproductions are adversarial artifacts, suppressible through technical countermeasures, and not representative of the model's ordinary operation.
The four-factor test, fashioned over decades of analog copying, must now be applied to a use case that copies everything and remembers nothing in particular.
The doctrinal question that NYT v. OpenAI raises, and that Authors Guild v. HathiTrust3 only partially answers, is whether the analysis should proceed at the level of the training operation itself (an ingestion that the defendants characterize as transformative) or at the level of the model's output (which the plaintiffs characterize as substituting for licensed access). The framing materially affects the outcome. A court that focuses on the training operation is likely to find substantial transformativeness; a court that focuses on the output is more likely to find market substitution.
The licensing-market complication
A separate strand of the doctrine concerns the fourth factor's treatment of the market for licensing. The Supreme Court has long held, in Campbell v. Acuff-Rose Music, Inc.,4 that the existence of a licensing market for the very use at issue weighs against fair use. The development, since 2023, of an active market for licensed AI training data — with several major publishers concluding agreements with leading model developers — supplies plaintiffs with a structural argument that the fair use defense is no longer available even if it might once have been. The counter-argument, that permitting plaintiffs to manufacture a licensing market by their own conduct cannot defeat what would otherwise be a fair use, has support in the doctrinal literature but thin direct authority.
Likely trajectory
Our prediction, advanced cautiously, is that the federal courts will resolve the pending litigation through a combination of partial summary judgments rather than a single categorical pronouncement. The training operation itself is likely to be found transformative as a general matter; the output-substitution question is likely to be resolved on a case-by-case basis, with substantial weight given to the actual operational characteristics of the defendant's model and to the existence of licensing alternatives. The resulting doctrine will likely require model developers to take reasonable measures to prevent verbatim reproduction of identifiable copyrighted works while preserving the general lawfulness of the ingestion operation.
For related discussion of the platform-liability implications of model-generated content, see our commentary on Section 230 in the age of generative AI and on the First Amendment treatment of algorithmic curation.
- 17 U.S.C. § 107.
- Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015).
- Authors Guild, Inc. v. HathiTrust, 755 F.3d 87 (2d Cir. 2014).
- Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).
- Andy Warhol Found. for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023).
- Complaint, New York Times Co. v. Microsoft Corp., No. 1:23-cv-11195 (S.D.N.Y. filed Dec. 27, 2023).
Related Commentary
- Section 230 in the Age of Generative AI— Platform Liability
- When Anonymous Speech Meets Defamation Liability— Free Speech
- The First Amendment Implications of Algorithmic Content Curation— Content Moderation