AI's Memorization Crisis

"In fact, when prompted strategically by researchers, Claude delivered the near-complete text of Harry Potter and the Sorcerer's Stone, The Great Gatsby, 1984, and Frankenstein, in addition to thousands of words from books including The Hunger Games and The Catcher in the Rye. Varying amounts of these books were also reproduced by the other three models. Thirteen books were tested."

"This phenomenon has been called "memorization," and AI companies have long denied that it happens on a large scale. In a 2023 letter to the U.S. Copyright Office, OpenAI said that "models do not store copies of the information that they learn from." Google similarly told the Copyright Office that "there is no copy of the training data-whether text, images, or other formats-present in the model itself." Anthropic, Meta, Microsoft, and others have made similar claims."

Researchers at Stanford and Yale demonstrated that four popular large language models—OpenAI's GPT, Anthropic's Claude, Google's Gemini, and xAI's Grok—have stored large portions of some books used in training and can reproduce long excerpts. Claude produced near-complete texts of several well-known novels and thousands of words from others, while the other models reproduced varying amounts. AI companies have previously denied that models store copies of training data. Image-based models can also reproduce trained art and photographs. The existence of verbatim reproductions poses potential legal liability, including large copyright judgments and possible product removals.

#ai-memorization #copyright #large-language-models #training-data

Read at The Atlantic

Unable to calculate read time

Collection

[

...

]

AI's Memorization CrisisAI's Memorization Crisis Briefly

AI's Memorization Crisis
AI's Memorization Crisis
Briefly