
"'We wanted to create this close-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on earth can really solve.'"
"'We've seen over the past few years insane progress on these language models. It's impressive, model builders have really done a great job at improving these reasoning models.'"
"'If we truly cared about this as the only thing in life, I think we could get to it pretty quickly.'"
Humanity's Last Exam (HLE) consists of 2,500 questions across various topics, requiring PhD-level understanding. AI systems like Google Gemini and Anthropic's Claude have shown significant improvement, with scores of 45.9% and 34.2%, respectively. Developers believe that achieving a perfect score is imminent, reflecting the rapid progress in AI capabilities. The test serves as a benchmark for AI intelligence compared to human experts, indicating a narrowing gap between AI and top academics.
Read at Mail Online
Unable to calculate read time
Collection
[
|
...
]