Amazon's bet that AI benchmarks don't matter
Briefly

Amazon's bet that AI benchmarks don't matter
""I want real-world utility. None of these benchmarks are real," Rohit Prasad, Amazon's SVP of AGI, told me ahead of today's announcements at AWS re:Invent in Las Vegas. "The only way to do real benchmarking is if everyone conforms to the same training data and the evals are completely held out. That's not what's happening. The evals are frankly getting noisy, and they're not showing the real power of these models.""
""It's a contrarian stance when every other AI lab is quick to boast about how their new models quickly climb the leaderboards. It's also convenient for Amazon, given that the previous version of Nova, its flagship model, was sitting at spot 79 on LMArena when Prasad and I spoke last week. Still, dismissing benchmarks only works if Amazon can offer a different story about what progress looks like.""
OpenAI, Anthropic, and Google dominate model leaderboards while Amazon urges a different focus on practical performance. Benchmarks produce noisy, misleading results when training data differ and evaluations are not fully held out. True benchmarking requires the same training datasets and completely withheld evals to reveal genuine capabilities. Leaderboard positions can obscure real-world utility and robustness, as illustrated by Nova's low LMArena ranking. Alternative evaluation approaches should emphasize task usefulness, tool integration, safety, and reproducibility. Shifting toward standardized, real-world evaluation aims to present a clearer picture of model progress beyond transient leaderboard gains.
Read at The Verge
Unable to calculate read time
[
|
]