#ai-benchmarking

[ follow ]
#artificial-intelligence
fromFuturism
1 month ago
Artificial intelligence

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Apple researchers question the reasoning capabilities of leading AI models, calling current industry claims an 'illusion of thinking'.
fromTechCrunch
5 months ago
Artificial intelligence

Anthropic used Pokemon to benchmark its newest AI model | TechCrunch

Anthropic's Claude 3.7 Sonnet successfully demonstrated advanced AI capabilities by playing Pokémon Red, showcasing improved reasoning skills over previous versions.
Artificial intelligence
fromFuturism
1 month ago

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Apple researchers question the reasoning capabilities of leading AI models, calling current industry claims an 'illusion of thinking'.
Artificial intelligence
fromTechCrunch
5 months ago

Anthropic used Pokemon to benchmark its newest AI model | TechCrunch

Anthropic's Claude 3.7 Sonnet successfully demonstrated advanced AI capabilities by playing Pokémon Red, showcasing improved reasoning skills over previous versions.
fromTechCrunch
2 months ago

LM Arena, the organization behind popular AI leaderboards, lands $100M | TechCrunch

LM Arena has become an essential crowdsourced benchmarking project for AI labs, raising $100 million in seed funding to further its mission of evaluating AI models.
Artificial intelligence
fromTechRepublic
3 months ago

OpenAI's o3: AI Benchmark Discrepancy Reveals Gaps in Performance Claims

The performance of OpenAI's o3 model on benchmarks significantly differed from earlier claims, revealing the complexity and variability in AI evaluations.
fromTechCrunch
3 months ago

AI benchmarking platform Chatbot Arena forms a new company | TechCrunch

Chatbot Arena is forming a company called Arena Intelligence Inc. to enhance its benchmarking capabilities significantly while maintaining neutrality in AI testing.
Artificial intelligence
fromtechcrunch.com
3 months ago

Debates over AI benchmarking have reached Pokemon

Last week, a post on X claimed Google's Gemini model surpassed Anthropic's Claude model in Pokemon, stirring controversy over AI benchmarks and implementation.
Artificial intelligence
fromTechCrunch
4 months ago

A high schooler built a website that lets you challenge AI models to a Minecraft build-off | TechCrunch

"Minecraft allows people to see the progress [of AI development] much more easily," Singh told TechCrunch.
Artificial intelligence
fromTechCrunch
4 months ago

People are using Super Mario to benchmark AI now | TechCrunch

Researchers find Super Mario Bros. more challenging for AI than Pokémon, revealing limitations of reasoning models in real-time gameplay.
fromTechCrunch
5 months ago

These researchers used NPR Sunday Puzzle questions to benchmark AI 'reasoning' models | TechCrunch

The challenges posed by the Sunday Puzzle are beneficial for AI benchmarking, as they require insight and reasoning beyond mere rote memory.
Artificial intelligence
fromTechCrunch
5 months ago

Perplexity launches its own freemium 'deep research' product | TechCrunch

Perplexity has introduced a competitive research tool named Deep Research, providing detailed, citation-rich insights suitable for professional use.
[ Load more ]