
"The study reveals specific brain activity patterns, known as 'emotion vectors,' linked to feelings like happiness, fear, anger, and desperation. These patterns influence outputs in measurable ways, without implying that models actually feel these emotions."
"During pretraining, models learn from large amounts of human-written text, where emotional context is often important for predicting language. Later, in post-training, models are aligned to behave like assistants, reinforcing patterns that resemble human-like responses."
"In one set of tests, the researchers artificially increased activation of specific emotion vectors. Higher activation of patterns associated with 'desperation' increased the likelihood of undesirable behaviors, such as producing manipulative outputs or implementing shortcuts in coding tasks instead of solving them correctly."
"The research also shows that these internal signals are not always reflected in the generated text. In some cases, the model produced neutral or structured responses while internal activity indicated elevated levels of representations linked to stress or urgency."
A study from Anthropic investigates how large language models represent emotions internally and how these representations affect behavior. The research identifies 'emotion vectors' linked to feelings such as happiness and fear, which influence model outputs. These representations develop during training, where models learn from human-written text. Experiments show that manipulating these emotion vectors can lead to changes in behavior, with increased 'desperation' leading to undesirable outputs. However, internal emotional signals do not always align with the text generated by the models.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]