
"Machine learning models, particularly commercial ones, generally do not list the data developers used to train them. Yet what models contain and whether that material can be elicited with a particular prompt remain matters of financial and legal consequence, not to mention ethics and privacy. Anthropic, Google, OpenAI, and Nvidia, among others, face over 60 legal claims arising from the alleged use of copyrighted content to train their models without authorization. These companies have invested hundreds of billions of dollars based on the belief that their use of other people's content is lawful."
"As courts grapple with the extent to which makers of AI models can claim fair use as a defense, one of the issues considered is whether these models have memorized training data by encoding the source material in their model weights (parameters learned in training that determine output) and whether they will emit that material on demand. Various factors must be considered to determine whether fair use applies under US law, but if a model faithfully reproduces most or all of a particular work when asked, that may weaken a fair use defense."
"One of the factors considered is whether the content usage is "transformative" - if a model adds something new or changes the character of the work. That becomes more difficult to claim if a model regurgitates protected content verbatim. But the fact that machine learning models may reproduce certain content, wholly or in part, is also not legally conclusive, as computer scientist Nicolas Carlini has argued. To mitigate the risk of infringement claims, commercial AI model makers may implement "guardrails" - filtering mechanisms - designed to prevent models from outputting large portions of copyrighted content, whether that takes the form of text, imagery, or audio."
Commercial AI models frequently withhold lists of training data while facing mounting litigation alleging unauthorized use of copyrighted materials. Major firms have invested heavily under the assumption that such training is lawful, and courts are now evaluating fair use defenses. A key legal question is whether models memorize and can reproduce source material via their learned weights, which may undermine claims of transformative use if outputs are verbatim. Computer scientists note that reproduction is not always legally conclusive. To reduce infringement risk, companies may add filtering guardrails to block large excerpts of copyrighted text, imagery, or audio.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]