#benchmarks

[ follow ]
fromZDNET
1 day ago

Does the new Flux.2 beat Nano Banana Pro? You can try it for yourself - for free

Some specific improvements of the model include support for up to 10 reference images, meaning you can incorporate a lot more elements from different pictures in your final product; improved photorealism and detail; more accurate text rendering, a task image generating models frequently struggle with; better prompt following; and a better understanding of real-world knowledge, according to Black Forest Labs.
Artificial intelligence
#enterprise-ai
fromInfoWorld
3 days ago
Artificial intelligence

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

fromInfoWorld
3 days ago
Artificial intelligence

Anthropic's Claude Opus 4.5 pricing cut signals a shift in the enterprise AI market

#gemini-3-pro
#gemini-3
fromFortune
1 week ago
Artificial intelligence

Gemini 3 and Antigravity, explained: Why Google's latest AI releases are a big deal | Fortune

fromFortune
1 week ago
Artificial intelligence

Gemini 3 and Antigravity, explained: Why Google's latest AI releases are a big deal | Fortune

Miscellaneous
fromIndependent
2 weeks ago

Government waves white flag on its housing targets with launch of its new strategy

Government abandons annual housing benchmarks due to construction slowdown; Taoiseach insists major investment in housing will succeed.
fromArs Technica
2 weeks ago

OpenAI walks a tricky tightrope with GPT-5.1's eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they're warmer, more conversational, and better at following instructions. The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.
Artificial intelligence
fromFuturism
2 weeks ago

Researchers "Embodied" an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World

The "Butter-Bench" test, as detailed in a yet-to-be-peer-reviewed paper, is a "benchmark that evaluates practical intelligence in embodied LLM." In the test, the robot had to navigate to an office kitchen, have butter be placed on a tray attached to its back, confirm the pickup, deliver it to a marked location, and finally return to its charging dock. The results of the Butter-Bench experiment, the researchers conceded, were dubious.
Artificial intelligence
#ai-evaluation
fromInfoQ
1 month ago
Artificial intelligence

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

fromInfoQ
1 month ago
Artificial intelligence

Google Stax Aims to Make AI Model Evaluation Accessible for Developers

fromBuffer: All-you-need social media toolkit for small businesses
2 months ago

What Is a Good Facebook Engagement Rate? Data From 52 Million+ Posts

One of the most common questions creators and brands ask: "Is my engagement rate good?" The answer depends on your follower count. A 5% engagement rate looks very different for a neighborhood café with 500 fans than for a news publisher with half a million. That's why we analyzed 52 million Facebook posts across 213,000 accounts with over 6.9 billion engagements collectively, to see how engagement rates shift by follower tier.
Online marketing
Artificial intelligence
fromZDNET
1 month ago

Even the best AI agents are thwarted by this protocol - what can be done

Even top AI models struggle to use Model Context Protocol, requiring many interaction rounds and MCP-specific training to handle complex multi-server tasks.
Artificial intelligence
fromInfoQ
1 month ago

Claude Sonnet 4.5 Tops SWE-Bench Verified, Extends Coding Focus Beyond 30 Hours

Claude Sonnet 4.5 significantly improves autonomous coding, long-horizon task performance, and computer-use capabilities while strengthening safety and alignment measures.
Artificial intelligence
fromTheregister
1 month ago

Microsoft adds Copilot adoption benchmarks to Viva Insights

Microsoft added Copilot adoption benchmarks to Viva Insights, enabling managers to compare active Copilot usage across cohorts, roles, regions, and other companies.
Artificial intelligence
fromInfoQ
1 month ago

Google DeepMind Launches Gemini 2.5 Computer Use Model to Power UI-Controlling AI Agents

Gemini 2.5 Computer Use enables AI agents to perceive and manipulate graphical user interfaces—clicking, typing, scrolling—via a looped screenshot-and-action API, showing strong benchmark performance.
Artificial intelligence
fromFortune
1 month ago

Anthropic releases Claude 4.5, a model it says can build software and accomplish business tasks autonomously | Fortune

Claude Sonnet 4.5 runs autonomously for 30 hours and significantly improves coding, benchmark performance, and business-oriented task completion over prior models.
fromWIRED
1 month ago

I Benchmarked Qualcomm's New Snapdragon X2 Elite Extreme. Here's What I Learned

It's important to note that this was all tested on the X2 Elite Extreme configuration, which comes with six additional CPU cores over the standard X2 Elite. There were no X2 Elite systems to test, so we don't know what those multi-core scores will be. I've been told that GPU performance will also scale up on the X2 Elite, but we don't yet know how much faster the X2 Elite Extreme is over its sibling.
Silicon Valley
fromInfoQ
2 months ago

xAI Releases Grok 4 Fast with Lower Cost Reasoning Model

xAI has introduced Grok 4 Fast, a new reasoning model designed for efficiency and lower cost. The model reduces average thinking tokens by 40% compared with Grok 4, which brings an estimated 98% decrease in cost for equivalent benchmark performance. It maintains a 2-million token context window and a unified architecture that supports both reasoning and non-reasoning use cases. The model also integrates tool-use capabilities such as web browsing and X search.
Artificial intelligence
Mobile UX
fromGSMArena.com
2 months ago

MediaTek confirms the Dimensity 9500's launch date and it's very close

MediaTek will unveil the Dimensity 9500 SoC on September 22, one day before Qualcomm's Snapdragon 8 Elite Gen 5 announcement.
fromTechzine Global
2 months ago

CrowdStrike and Meta launch open source AI benchmarks for SOC

CrowdStrike and Meta are jointly introducing CyberSOCEval, a new suite of open source benchmarks to evaluate the performance of AI systems in security operations. The collaboration aims to help organizations select more effective AI tools for their Security Operations Center. Meta and CrowdStrike are addressing a growing challenge by introducing CyberSOCEval, a suite of benchmarks that help define what effective AI looks like for cyber defense. The system is built on Meta's open source CyberSecEval framework and CrowdStrike's frontline threat intelligence.
Artificial intelligence
Artificial intelligence
fromRealpython
2 months ago

Episode #264: Large Language Models on the Edge of the Scaling Laws - The Real Python Podcast

LLM scaling is reaching diminishing returns; benchmarks are often flawed, and developer productivity gains from these models remain modest amid economic hiring shifts.
fromPeterbe
5 months ago

Native connection pooling in Django 5 with PostgreSQL - Peterbe.com

Adding 'OPTIONS': {'pool': True}, to the DATABASES['default'] config made this endpoint 5.4 times faster.
Django
[ Load more ]