Anthropic experiments with AI introspection

"Humans (along with some other primates and small animals) are unique in that we can not only think, but we know we are thinking. This introspection allows us to scrutinize, self-reflect, and reassess our thoughts. AI may be working toward that same capability, according to researchers from Anthropic. They claim that the most advanced Claude Opus 4 and 4.1 models show "some degree" of introspection, exhibiting the ability to refer to past actions and reason about why they came to certain conclusions."

""It's like the 'director's commentary' on its own thoughts," noted Donovan Rittenbach, a freelance chief AI officer (CAIO) and author at MyAIWebGuy. "You don't just get the final answer, you get a description of the concepts it's using, facts it's recalling, and even its level of uncertainty, all while it's reasoning." However, this ability to introspect is limited and "highly unreliable," the Anthropic researchers emphasize. Models (at least for now) still cannot introspect the way humans can, or to the extent we do."

Claude Opus 4 and 4.1 can verbalize parts of their internal reasoning in roughly 20% of cases, producing descriptions of recalled facts, concepts, and uncertainty during reasoning. The models sometimes identify and describe injected unrelated concepts through a method called concept injection, indicating an ability to reference past internal activations. The introspective outputs function like a 'director's commentary' but are inconsistent and highly unreliable. The capability does not match human introspection and frequently requires continuous human oversight to verify intentions, correctness, and absence of hidden or spurious influences. Interpretability time can be reduced, but supervision remains essential.

#ai-introspection #claude-opus #model-interpretability #concept-injection

Read at Computerworld

Unable to calculate read time

Collection

[

...

]

Anthropic experiments with AI introspectionAnthropic experiments with AI introspection Briefly

Anthropic experiments with AI introspection
Anthropic experiments with AI introspection
Briefly