OpenAI Research Finds That Even Its Best Models Give Wrong Answers a Wild Proportion of the Time
Briefly

OpenAI's o1-preview model scored only 42.7% on the SimpleQA benchmark, revealing a worrying trend among advanced AI models to produce incorrect answers more often than correct ones.
Despite the low success rates, AI technologies continue to be integrated into daily life. For instance, hospitals adopting AI for transcription have faced severe issues with inaccuracies.
Competitors like Anthropic's Claude-3.5-sonnet model performed even worse, achieving a score of merely 28.9%, while maintaining a more cautious approach about answering questions.
The models are often overconfident, showcasing a tendency to 'hallucinate' or create elaborate falsehoods, contributing to a reliance on unreliable outputs in critical areas.
Read at Futurism
[
|
]