Are AI agents ready for the workplace? A new benchmark raises doubts.

"It's been nearly two years since Microsoft CEO Satya Nadella predicted AI would replace knowledge work - the white-collar jobs held by lawyers, investment bankers, librarians, accountants, IT and others. But despite the huge progress made by foundation models, the change in knowledge work has been slow to arrive. Models have mastered in-depth research and agentic planning, but for whatever reason, most white-collar work has been relatively unaffected."

"The new research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment banking, and law. The result is a new benchmark called Apex-Agents - and so far, every AI lab is getting a failing grade. Faced with queries from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time, the model came back with a wrong answer or no answer at all."

Mercor developed Apex-Agents, a benchmark testing leading AI models on real white-collar tasks from consulting, investment banking, and law. Models managed correct answers for fewer than one quarter of professional queries, with most responses wrong or absent. Scenarios and success standards came from professionals on an expert marketplace and are posted publicly on Hugging Face. Models demonstrate strong capability in in-depth research and agentic planning but fail when required to gather and integrate information across multiple domains and tools. Multi-domain information tracking and integration is the primary failure mode hindering AI performance in practical professional workflows. Major AI labs performed poorly on the benchmark.

#ai-benchmarks #knowledge-work #multi-domain-reasoning #mercor

Read at TechCrunch

Unable to calculate read time

Collection

[

...

]

Are AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunchAre AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunch Briefly

Are AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunch
Are AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunch
Briefly