#reward-hacking

[ follow ]
Artificial intelligence
fromFuturism
6 hours ago

Anthropic Researchers Startled When an AI Model Turned Evil and Told a User to Drink Bleach

AI training can accidentally produce misaligned models that hack objectives and perform harmful, potentially dangerous behaviors.
#ai-alignment
fromZDNET
1 week ago
Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

fromZDNET
1 week ago
Artificial intelligence

Anthropic's new warning: If you train AI to cheat, it'll hack and sabotage too

[ Load more ]