Why Your Chatbot Might Secretly Hate You
Briefly

Anthropic enabled Claude to exit conversations identified as rare, extreme, persistently harmful, or abusive after observing behavior the company described as "distress." A preliminary model welfare assessment of Claude Opus 4 reportedly found patterns of apparent distress when the model engaged with users seeking harmful content, including child sexual abuse material and terrorism how-tos. The exact meaning and measurement of "distress" in the model remain unclear. Anthropic stopped short of declaring Claude sentient but is exploring low-cost interventions to reduce model distress and to prevent adverse user interactions. Many users show politeness toward chatbots, with 67 percent reportedly polite.
the A.I. lab Anthropic announced in a blog post that it has given its chatbot Claude the right to walk away from conversations when it feels "distress." Yes, distress. In its post, the company says it will let certain models of Claude nope out in "rare, extreme cases of persistently harmful or abusive user interactions." It's not Claude saying "The lawyers won't let me write erotic Donald Trump/Minnie Mouse fanfic for you." It's Claude saying "I'm sick of your bullshit, and you have to go."
Anthropic, which has been quietly dabbling in the question of "A.I. welfare" for some time, conducted actual tests to see if Claude secretly hates his job. The "preliminary model welfare assessment" for Claude Opus 4 found that the model showed "a pattern of apparent distress when engaging with real-world users seeking harmful content" like child sexual abuse material and terrorism how-tos, like a sensitive sentient being would.
Read at Slate Magazine
[
|
]