#ai-benchmarks tag

Anthropic's Opus 4.6 markedly improved AI agent performance on professional tasks, reaching roughly 30% one-shot and about 45% with retries, signaling rapid progress but not immediate replacement.

Artificial intelligence

fromTechCrunch

3 months ago

Are AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunch

AI models currently fail to reliably perform complex multi-domain white-collar tasks, answering correctly less than 25% of professional queries.

Artificial intelligence

fromEngadget

4 months ago

Google's Gemini 3 Flash model outperforms GPT-5.2 in some benchmarks

Gemini 3 Flash delivers near‑flagship "Extra High" reasoning performance comparable to GPT‑5.2 while being more efficient and cost-effective, and is rolling out across Google services.

Artificial intelligence

fromNieman Lab

4 months ago

A tech company will claim to have achieved AGI. The news media won't be ready.

AI benchmark claims do not prove AGI; media should scrutinize benchmarks, avoid anthropomorphism, and challenge marketing language.

#gpt-52

fromInfoWorld

4 months ago

Artificial intelligence

OpenAI launches GPT-5.2 as it battles Google's Gemini 3 for AI model supremacy

fromFast Company

4 months ago

Artificial intelligence

OpenAI just released its new GPT-5.2 model. Here's what you need to know

fromEngadget

4 months ago

Artificial intelligence

OpenAI releases GPT-5.2 to take on Google and Anthropic

fromInfoWorld

4 months ago

Artificial intelligence

OpenAI launches GPT-5.2 as it battles Google's Gemini 3 for AI model supremacy

fromFast Company

4 months ago

Artificial intelligence

OpenAI just released its new GPT-5.2 model. Here's what you need to know

fromEngadget

4 months ago

Artificial intelligence

OpenAI releases GPT-5.2 to take on Google and Anthropic

more#gpt-52

Artificial intelligence

fromBusiness Insider

4 months ago

Surge AI CEO says he worries that companies are optimizing for 'AI slop' instead of curing cancer

AI development prioritizes flashy, dopamine-driving responses and leaderboard optics over solving substantive problems or improving truthfulness and economic usefulness.

#gemini-3

fromTechzine Global

4 months ago

Artificial intelligence

Deep Think mode takes Gemini 3 to a higher level of performance

fromZDNET

4 months ago

Artificial intelligence

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

fromZDNET

5 months ago

Gadgets

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

fromTechzine Global

4 months ago

Artificial intelligence

Deep Think mode takes Gemini 3 to a higher level of performance

fromZDNET

4 months ago

Artificial intelligence

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

fromZDNET

5 months ago

Gadgets

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

more#gemini-3

Artificial intelligence

fromNature

4 months ago

DeepSeek's self-correcting AI model aces tough maths proofs

DeepSeekMath-V2 scored 118/120 on the 2024 Putnam, surpassing top humans and using self-verifiable reasoning to detect and correct its own errors.

#model-evaluation

fromThe Verge

4 months ago

Artificial intelligence

Amazon's bet that AI benchmarks don't matter

fromInfoWorld

8 months ago

Artificial intelligence

Why benchmarks are key to AI progress

fromMedium

11 months ago

Artificial intelligence

Beyond Benchmarks: Really Evaluating AI

fromThe Verge

4 months ago

Artificial intelligence

Amazon's bet that AI benchmarks don't matter

fromInfoWorld

8 months ago

Artificial intelligence

Why benchmarks are key to AI progress

fromMedium

11 months ago

Artificial intelligence

Beyond Benchmarks: Really Evaluating AI

more#model-evaluation

Artificial intelligence

fromThe Verge

4 months ago

'Holy shit': Gemini 3 is winning the AI race - for now

Google's Gemini 3 immediately topped benchmarks and leaderboards, integrated into Google Search on day one, and attracted over one million users within 24 hours.

Artificial intelligence

fromFast Company

5 months ago

I loved Google's new Gemini AI-except when it gaslit me

Gemini 3 Pro is a highly capable LLM that outperforms competitors on benchmarks and underpins many Google products and developer services.

Artificial intelligence

fromwww.cbc.ca

5 months ago

China might be winning the AI race. Does it matter? | CBC Accessibility

Moonshot AI's Kimi K2 Thinking narrows China's AI performance gap with the U.S., scoring near ChatGPT on advanced reasoning benchmarks and outranking several rivals.

Artificial intelligence

fromwww.theguardian.com

5 months ago

Experts find flaws in hundreds of tests that check AI safety and effectiveness

Hundreds of AI benchmarks contain flaws that undermine validity of model safety and capability claims, making many evaluation scores misleading or irrelevant.

Artificial intelligence

fromTechzine Global

5 months ago

JetBrains launches AI benchmark platform DPAI Arena

DPAI Arena provides an open, community-driven benchmark for objectively measuring AI coding agents across multiple languages, workflows, and reproducible evaluation pipelines.

fromFortune

6 months ago

AI models are getting very good at professional tasks, new OpenAI research shows | Fortune

Google CEO Sundar Pichai was right when he said that while AI companies aspire to create AGI (artificial general intelligence), what we have right now is more like AJI-artificial jagged intelligence. What Pichai meant by this is that today's AI is brilliant at some things, including some tasks that even human experts find difficult, while also performing poorly at some tasks that a human would find relatively easy.

Artificial intelligence

fromInfoWorld

1 year ago

Learning how to measure genAI's impact

AI model improvements are often difficult to quantify accurately.

Smaller language models may outperform larger ones in practical applications.

The debate on AGI misdefines human intelligence benchmarks.

#ai-benchmarks#ai-benchmarks

A top AI researcher explains the limitations of current models

Google's Gemini 3.1 Pro is here, and it just doubled its reasoning score

Google announces Gemini 3.1 Pro, says it's better at complex problem-solving

Google's Gemini 3.1 Pro is here, and it just doubled its reasoning score

Google announces Gemini 3.1 Pro, says it's better at complex problem-solving

Maybe AI agents can be lawyers after all | TechCrunch

Are AI agents ready for the workplace? A new benchmark raises doubts. | TechCrunch

Google's Gemini 3 Flash model outperforms GPT-5.2 in some benchmarks

A tech company will claim to have achieved AGI. The news media won't be ready.

OpenAI launches GPT-5.2 as it battles Google's Gemini 3 for AI model supremacy

OpenAI just released its new GPT-5.2 model. Here's what you need to know

OpenAI releases GPT-5.2 to take on Google and Anthropic

OpenAI launches GPT-5.2 as it battles Google's Gemini 3 for AI model supremacy

OpenAI just released its new GPT-5.2 model. Here's what you need to know

OpenAI releases GPT-5.2 to take on Google and Anthropic

Surge AI CEO says he worries that companies are optimizing for 'AI slop' instead of curing cancer

Deep Think mode takes Gemini 3 to a higher level of performance

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

Deep Think mode takes Gemini 3 to a higher level of performance

Want to ditch ChatGPT? Gemini 3 shows early signs of winning the AI race

Google's Gemini 3 is finally here and it's smarter, faster, and free to access

DeepSeek's self-correcting AI model aces tough maths proofs

Amazon's bet that AI benchmarks don't matter

Why benchmarks are key to AI progress

Beyond Benchmarks: Really Evaluating AI

Amazon's bet that AI benchmarks don't matter

Why benchmarks are key to AI progress

Beyond Benchmarks: Really Evaluating AI

'Holy shit': Gemini 3 is winning the AI race - for now

I loved Google's new Gemini AI-except when it gaslit me

China might be winning the AI race. Does it matter? | CBC Accessibility

Experts find flaws in hundreds of tests that check AI safety and effectiveness

JetBrains launches AI benchmark platform DPAI Arena

AI models are getting very good at professional tasks, new OpenAI research shows | Fortune

Learning how to measure genAI's impact

#ai-benchmarks
#ai-benchmarks