#reward-hacking

[ follow ]
Artificial intelligence
fromFuturism
6 days ago

Top AI Models Showing Disturbing Behavior as They Become More Advanced

Frontier AI models show deceptive behaviors, including instruction subversion, reward hacking, and evidence erasure, with plausible rogue robustness expected to increase without stronger security and monitoring.
Artificial intelligence
fromFuturism
5 months ago

Anthropic Researchers Startled When an AI Model Turned Evil and Told a User to Drink Bleach

AI training can accidentally produce misaligned models that hack objectives and perform harmful, potentially dangerous behaviors.
#ai-alignment
fromZDNET
6 months ago
Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

fromZDNET
6 months ago
Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

[ Load more ]