12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training

from The Hacker News 4 months ago

A recent analysis revealed that a dataset used for training large language models contains nearly 12,000 live secrets, including API keys and passwords, posing significant security risks. The dataset, archived from Common Crawl, comprises 400TB of web data from millions of domains. Security experts note that such live secrets cannot be distinguished from invalid ones during model training, potentially leading to insecure coding practices. A related threat involves AI chatbots accessing previously private data from public repositories, which could expose sensitive information further.

'Live' secrets are API keys, passwords, and other credentials that successfully authenticate with their respective services,

Read at The Hacker News

#security #api-keys #large-language-models #data-privacy #coding-practices

Collection

[

...

]

12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training Briefly

12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training
12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training
Briefly