#ai-benchmarks

[ follow ]
Artificial intelligence
fromBusiness Insider
22 hours ago

Surge AI CEO says he worries that companies are optimizing for 'AI slop' instead of curing cancer

AI development prioritizes flashy, dopamine-driving responses and leaderboard optics over solving substantive problems or improving truthfulness and economic usefulness.
#gemini-3
fromZDNET
1 week ago
Artificial intelligence

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

fromZDNET
2 weeks ago
Gadgets

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

fromZDNET
1 week ago
Artificial intelligence

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

fromZDNET
2 weeks ago
Gadgets

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

Artificial intelligence
fromNature
5 days ago

DeepSeek's self-correcting AI model aces tough maths proofs

DeepSeekMath-V2 scored 118/120 on the 2024 Putnam, surpassing top humans and using self-verifiable reasoning to detect and correct its own errors.
#model-evaluation
Artificial intelligence
fromThe Verge
2 weeks ago

'Holy shit': Gemini 3 is winning the AI race - for now

Google's Gemini 3 immediately topped benchmarks and leaderboards, integrated into Google Search on day one, and attracted over one million users within 24 hours.
Artificial intelligence
fromwww.cbc.ca
2 weeks ago

China might be winning the AI race. Does it matter? | CBC Accessibility

Moonshot AI's Kimi K2 Thinking narrows China's AI performance gap with the U.S., scoring near ChatGPT on advanced reasoning benchmarks and outranking several rivals.
Artificial intelligence
fromwww.theguardian.com
1 month ago

Experts find flaws in hundreds of tests that check AI safety and effectiveness

Hundreds of AI benchmarks contain flaws that undermine validity of model safety and capability claims, making many evaluation scores misleading or irrelevant.
Artificial intelligence
fromTechzine Global
1 month ago

JetBrains launches AI benchmark platform DPAI Arena

DPAI Arena provides an open, community-driven benchmark for objectively measuring AI coding agents across multiple languages, workflows, and reproducible evaluation pipelines.
fromFortune
2 months ago

AI models are getting very good at professional tasks, new OpenAI research shows | Fortune

Google CEO Sundar Pichai was right when he said that while AI companies aspire to create AGI (artificial general intelligence), what we have right now is more like AJI-artificial jagged intelligence. What Pichai meant by this is that today's AI is brilliant at some things, including some tasks that even human experts find difficult, while also performing poorly at some tasks that a human would find relatively easy.
Artificial intelligence
Artificial intelligence
fromInfoWorld
7 months ago

Learning how to measure genAI's impact

AI model improvements are often difficult to quantify accurately.
Smaller language models may outperform larger ones in practical applications.
The debate on AGI misdefines human intelligence benchmarks.
[ Load more ]