Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Safe Generative AI

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Joshua Freeman · Chloe Rippe · Edoardo Debenedetti · Maksym Andriushchenko


Abstract:

Copyright infringement in frontier large language models ("LLMs") has received much attention recently due to the NYT v. Microsoft/OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights through reproducing articles for use in LLM training and by memorizing the inputs and thereby publicly displaying them in LLM outputs. This research attempts to measure the propensity of OpenAI's LLM to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. LLMs operate on statistical patterns, indirectly "storing" information by learning the statistical distribution of text over a training corpus. We show that OpenAI models are the least memorization-prone of the four leading LLM providers we benchmarked. We also find that the bigger the model, the more memorization we can elicit, particularly for models with more than 100 billion parameters. Our findings have practical implications for training: more attention must be put on preventing verbatim memorization for bigger models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLM, we probe the strength of The New York Times' copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative artificial intelligence ("AI") and law and policy more broadly.

Chat is not available.