A recent analysis revealed that a dataset used for training large language models contains nearly 12,000 live secrets, including API keys and passwords, posing significant security risks. The dataset, archived from Common Crawl, comprises 400TB of web data from millions of domains. Security experts note that such live secrets cannot be distinguished from invalid ones during model training, potentially leading to insecure coding practices. A related threat involves AI chatbots accessing previously private data from public repositories, which could expose sensitive information further.
'Live' secrets are API keys, passwords, and other credentials that successfully authenticate with their respective services,
Collection
[
|
...
]