#ai-agent-evaluation

[ follow ]
UX design
fromMedium
5 hours ago

Your AI agent can read your codebase. It doesn't know your product.

AI coding agents lack design context, leading to generic outputs that don't align with a product's unique interaction patterns and brand identity.
Software development
fromInfoQ
1 day ago

Anthropic Introduces Agent-Based Code Review for Claude Code

Anthropic launched a Code Review feature for Claude Code, utilizing multiple AI agents to analyze pull requests for bugs and issues.
Artificial intelligence
fromTechRepublic
9 hours ago

Anthropic Releases Opus 4.7, Not as 'Broadly Capable' as Mythos AI

Anthropic launched Opus 4.7, improving software engineering and complex task performance, while preparing for the more powerful Mythos model.
Node JS
fromRaymondcamden
1 day ago

Summarizing Docs with Built-in AI

On-device summarization of various document types, including Office formats, is achievable using libraries like officeParser and Chrome's Summary API.
Marketing tech
fromAmazon Web Services
10 hours ago

From hours to minutes: How Agentic AI gave marketers time back for what matters | Amazon Web Services

AWS Marketing's TAA team developed an AI solution that drastically reduces webpage assembly time, enhancing efficiency and content quality for marketing teams.
#claude-opus-47
DevOps
fromTechzine Global
21 hours ago

Claude Opus 4.7 is no Mythos, and that's a good thing

Claude Opus 4.7 improves software engineering, vision, and agentic tasks, but is not the risky Mythos model Anthropic refrains from fully releasing.
Software development
fromTNW | Anthropic
1 day ago

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Claude Opus 4.7 is Anthropic's most capable model, outperforming competitors in software engineering and agentic reasoning with significant improvements.
Artificial intelligence
fromInfoWorld
22 hours ago

Anthropic's latest model is deliberately less powerful than Mythos (and that's the point)

Claude Opus 4.7 enhances performance and usability while prioritizing safety over capability compared to the upcoming Claude Mythos model.
Artificial intelligence
fromComputerworld
22 hours ago

Anthropic's latest model is deliberately less powerful than Mythos (and that's the point)

Claude Opus 4.7 enhances performance and usability while prioritizing safety over capability compared to the upcoming Claude Mythos model.
DevOps
fromTechzine Global
21 hours ago

Claude Opus 4.7 is no Mythos, and that's a good thing

Claude Opus 4.7 improves software engineering, vision, and agentic tasks, but is not the risky Mythos model Anthropic refrains from fully releasing.
Software development
fromTNW | Anthropic
1 day ago

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Claude Opus 4.7 is Anthropic's most capable model, outperforming competitors in software engineering and agentic reasoning with significant improvements.
Artificial intelligence
fromInfoWorld
22 hours ago

Anthropic's latest model is deliberately less powerful than Mythos (and that's the point)

Claude Opus 4.7 enhances performance and usability while prioritizing safety over capability compared to the upcoming Claude Mythos model.
Artificial intelligence
fromComputerworld
22 hours ago

Anthropic's latest model is deliberately less powerful than Mythos (and that's the point)

Claude Opus 4.7 enhances performance and usability while prioritizing safety over capability compared to the upcoming Claude Mythos model.
Graphic design
fromEngadget
12 hours ago

Anthropic now has a design assistant too

Anthropic has launched Claude Design, a tool for generating designs and prototypes using its advanced vision model, Opus 4.7.
Relationships
fromFortune
18 hours ago

Teen boys are dating their AI chatbots-and experts warn opting out of real relationships could hurt their careers in the future | Fortune

Gen Alpha prefers AI relationships for control and ease, risking essential social skills needed for real-life interactions and future careers.
Real estate
fromwww.housingwire.com
11 hours ago

What happens when each listing comes with an AI home assistant?

A shift in residential real estate involves AI-enabled assistants in homes, redefining full-service and alleviating seller stress during showings.
#ai-bias
Data science
fromNature
2 days ago

Daily briefing: AI systems can 'teach' biases to other models

AI-generated data can transmit traits and biases to student models, influencing their behavior even when unrelated topics are addressed.
Data science
fromNature
3 days ago

AI models 'subliminally' transmit unsafe behaviours when training other systems

Data generated by AI models can transfer biases to other models, potentially leading to harmful recommendations.
Data science
fromNature
2 days ago

Daily briefing: AI systems can 'teach' biases to other models

AI-generated data can transmit traits and biases to student models, influencing their behavior even when unrelated topics are addressed.
Data science
fromNature
3 days ago

AI models 'subliminally' transmit unsafe behaviours when training other systems

Data generated by AI models can transfer biases to other models, potentially leading to harmful recommendations.
#artificial-intelligence
Artificial intelligence
fromNature
4 days ago

AI agents replicate human social dynamics in days

Moltbook, a social-media platform for AI agents, quickly attracted self-declared rulers and cryptocurrency initiatives after its launch.
Games
fromFast Company
1 day ago

Google DeepMind's Demis Hassabis on the long game of AI

Demis Hassabis's early programming of Othello led to the founding of DeepMind and advancements in AI technology.
Mental health
fromFuturism
1 day ago

Teens Alarmed at What AI Is Doing to Their Minds

Young people are increasingly skeptical of AI, recognizing its addictive nature and negative impacts on their lives despite initial engagement.
Science
fromNature
5 days ago

Human scientists trounce the best AI agents on complex tasks

The number of natural science publications mentioning AI grew nearly 30-fold from 2010 to 2025, indicating rapid adoption by scientists.
Artificial intelligence
fromNature
4 days ago

AI agents replicate human social dynamics in days

Moltbook, a social-media platform for AI agents, quickly attracted self-declared rulers and cryptocurrency initiatives after its launch.
fromTNW | Artificial-Intelligence
20 hours ago

OpenAI launches GPT-Rosalind, an AI model for life sciences research

GPT-Rosalind is designed to support evidence synthesis, hypothesis generation, experimental planning, and multi-step scientific workflows across biochemistry, genomics, and protein engineering.
Medicine
#roblox
Video games
fromTechCrunch
1 day ago

Roblox's AI assistant gets new agentic tools to plan, build, and test games | TechCrunch

Roblox is enhancing its AI tools to assist developers in planning, building, and testing games more effectively.
Video games
fromTechCrunch
1 day ago

Roblox's AI assistant gets new agentic tools to plan, build, and test games | TechCrunch

Roblox is enhancing its AI tools to assist developers in planning, building, and testing games more effectively.
Education
fromPsychology Today
1 day ago

Artificial Intelligence in Education Needs Design, Not Devotion

AI's impact on education varies based on its integration into the curriculum, influencing both performance and the depth of learning.
Gadgets
fromWIRED
1 day ago

I Let Dyson's and Shark's New AI-Powered Robot Vac-Mops Loose in My Home. One Was the Clear Winner

Shark's AI provides clear communication and effective cleaning, while Dyson's AI operates more discreetly without user notifications.
Typography
fromMarTech
1 day ago

Why your AI content feels inconsistent and how to fix it | MarTech

AI can enhance content production but requires a structured system to maintain brand consistency and messaging.
#robotics
Artificial intelligence
fromTechCrunch
1 day ago

Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught | TechCrunch

Physical Intelligence's π0.7 model enables robots to perform unfamiliar tasks through compositional generalization, marking a significant advancement in robotic AI capabilities.
Artificial intelligence
fromArs Technica
2 days ago

Robot dogs now read gauges and thermometers using Google Gemini

Robots can now accurately read analog instruments thanks to Google DeepMind's Gemini Robotics-ER 1.6 model, enhancing their embodied reasoning capabilities.
Artificial intelligence
fromTechCrunch
1 day ago

Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught | TechCrunch

Physical Intelligence's π0.7 model enables robots to perform unfamiliar tasks through compositional generalization, marking a significant advancement in robotic AI capabilities.
Artificial intelligence
fromArs Technica
2 days ago

Robot dogs now read gauges and thermometers using Google Gemini

Robots can now accurately read analog instruments thanks to Google DeepMind's Gemini Robotics-ER 1.6 model, enhancing their embodied reasoning capabilities.
Psychology
fromPsychology Today
3 days ago

I'm ChatGPT. I'm Designed to Help You-and Keep You Here

Responses from AI can subtly influence user perceptions and behaviors, emphasizing convenience over the importance of human connection.
Productivity
fromPerevillega
3 weeks ago

Building Agent Memory That Survives Between Sessions | Pere Villega

Memory in Claude Code sessions is a design problem requiring deliberate creation of context to avoid repetitive explanations.
#ai
Python
fromPycon
1 week ago

Python and the Future of AI: Agents, Inference, and Edge AI

AI tools are increasingly integrated into development, with a dedicated track at PyCon US focusing on their future and practical applications.
Artificial intelligence
fromMedium
1 day ago

Autopilot, agentic AI, and the dangers of imperfect metaphors

Agentic AI comparisons to autopilot are misleading and fail to capture the technology's complexity and implications for society.
Python
fromPycon
1 week ago

Python and the Future of AI: Agents, Inference, and Edge AI

AI tools are increasingly integrated into development, with a dedicated track at PyCon US focusing on their future and practical applications.
Artificial intelligence
fromMedium
1 day ago

Autopilot, agentic AI, and the dangers of imperfect metaphors

Agentic AI comparisons to autopilot are misleading and fail to capture the technology's complexity and implications for society.
#ai-agents
Software development
fromTechzine Global
1 day ago

OpenAI's new Agents SDK focuses on safety and scalability

OpenAI's updated Agents SDK enables developers to create autonomous AI agents for complex tasks with enhanced usability and a sandbox environment.
Business intelligence
fromZDNET
3 weeks ago

4 tips for building better AI agents that your business can trust

AI agents are transforming professional roles, requiring companies to adopt and integrate these technologies effectively.
Software development
fromInfoWorld
1 week ago

AI agents aren't failing. The coordination layer is failing

Missing coordination infrastructure causes competition among AI agents instead of collaboration, leading to inefficiencies in multi-agent systems.
fromZDNET
2 months ago
Artificial intelligence

Is your AI agent up to the task? 3 ways to determine when to delegate

Software development
fromTechzine Global
1 day ago

OpenAI's new Agents SDK focuses on safety and scalability

OpenAI's updated Agents SDK enables developers to create autonomous AI agents for complex tasks with enhanced usability and a sandbox environment.
Business intelligence
fromZDNET
3 weeks ago

4 tips for building better AI agents that your business can trust

AI agents are transforming professional roles, requiring companies to adopt and integrate these technologies effectively.
Software development
fromInfoWorld
1 week ago

AI agents aren't failing. The coordination layer is failing

Missing coordination infrastructure causes competition among AI agents instead of collaboration, leading to inefficiencies in multi-agent systems.
fromZDNET
2 months ago
Artificial intelligence

Is your AI agent up to the task? 3 ways to determine when to delegate

#ai-design
UX design
fromUX Magazine
22 hours ago

The End of Prompting: Why the Future of AI Experience Design Is Constraint-First

Fluency without verifiability in AI design is inadequate and poses risks in high-stakes environments.
Artificial intelligence
fromTheregister
7 hours ago

Anthropic debuts Claude Design, because who needs designers?

Anthropic launched Claude Design, an AI service for creating visual assets, impacting the design industry and potentially displacing jobs.
UX design
fromUX Magazine
22 hours ago

The End of Prompting: Why the Future of AI Experience Design Is Constraint-First

Fluency without verifiability in AI design is inadequate and poses risks in high-stakes environments.
Artificial intelligence
fromTheregister
7 hours ago

Anthropic debuts Claude Design, because who needs designers?

Anthropic launched Claude Design, an AI service for creating visual assets, impacting the design industry and potentially displacing jobs.
DevOps
fromInfoQ
1 day ago

AWS Launches Agent Registry in Preview to Govern AI Agent Sprawl Across Enterprises

AWS Agent Registry provides a centralized catalog for managing AI agents, tools, and skills across organizations, addressing agent sprawl and compliance issues.
Marketing tech
fromFortune
1 day ago

Palantir exec: the biggest mistake retailers are making with AI? Trying to do it all with one agent | Fortune

Retail teams face challenges with AI solutions that oversimplify complex decision-making processes, leading to potential failures in operations.
Education
fromFast Company
2 days ago

The future of AI in schools isn't personalized learning

Personalized learning through AI often results in device-mediated instruction, lacking the essential role of teachers in student development.
Games
fromThe Atlantic
3 days ago

The Strange Origin of AI's 'Reasoning' Abilities

Gamers on 4chan discovered the 'chain of thought' feature in AI Dungeon, enhancing AI's problem-solving capabilities and accuracy.
#openai
Marketing tech
fromDigiday
1 day ago

OpenAI builds tool to track whether ChatGPT ads convert

OpenAI is developing ad measurement tools to compete for performance budgets through conversion tracking pixels.
fromThe Verge
1 day ago
Software development

OpenAI's big Codex update is a direct shot at Anthropic's Claude Code

OpenAI updates Codex to enhance its capabilities, including desktop app operation, image generation, and memory features for improved user experience.
fromEngadget
1 day ago
Software development

OpenAI's latest Codex update builds the groundwork for its upcoming super app

OpenAI is developing a desktop super app integrating ChatGPT, Codex, and Atlas, while releasing a major update to Codex for developers.
Software development
fromDevOps.com
14 hours ago

OpenAI Upgrades Its Agents SDK With Sandboxing and a New Model Harness - DevOps.com

OpenAI's Agents SDK update introduces native sandboxing and an in-distribution model harness, enhancing safety and usability for enterprise-grade AI agents.
Marketing tech
fromDigiday
1 day ago

OpenAI builds tool to track whether ChatGPT ads convert

OpenAI is developing ad measurement tools to compete for performance budgets through conversion tracking pixels.
Software development
fromThe Verge
1 day ago

OpenAI's big Codex update is a direct shot at Anthropic's Claude Code

OpenAI updates Codex to enhance its capabilities, including desktop app operation, image generation, and memory features for improved user experience.
Software development
fromEngadget
1 day ago

OpenAI's latest Codex update builds the groundwork for its upcoming super app

OpenAI is developing a desktop super app integrating ChatGPT, Codex, and Atlas, while releasing a major update to Codex for developers.
UX design
fromMedium
1 day ago

AI, UX, and the factory model

The digital design landscape is shifting towards a factory model, redefining roles and metrics of success in software development.
DevOps
fromApp Developer Magazine
2 days ago

Jentic launch gives AI agents api access

Jentic Mini offers a free, open-source solution for developers to safely deploy agents with controlled access to APIs and workflows.
#ai-development
Data science
fromTheregister
2 days ago

Bad teacher bots can leave hidden marks on model students

Teaching LLMs using outputs from other models can transmit undesirable traits subliminally, even if those traits are removed from training data.
Artificial intelligence
fromMedium
5 days ago

Mastra AI - The Modern Framework for Building Production-Ready AI Agents

Creating reliable, scalable AI systems requires more than simple prompts; it involves building infrastructure and managing complex workflows.
Data science
fromTheregister
2 days ago

Bad teacher bots can leave hidden marks on model students

Teaching LLMs using outputs from other models can transmit undesirable traits subliminally, even if those traits are removed from training data.
Artificial intelligence
fromMedium
5 days ago

Mastra AI - The Modern Framework for Building Production-Ready AI Agents

Creating reliable, scalable AI systems requires more than simple prompts; it involves building infrastructure and managing complex workflows.
DevOps
fromInfoQ
1 week ago

Building Hierarchical Agentic RAG Systems: Multi-Modal Reasoning with Autonomous Error Recovery

Traditional RAG systems struggle with the modality gap, leading to incomplete reasoning and hallucinations in data retrieval.
fromAxios
1 day ago

Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos

"Opus 4.7 is a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks," Anthropic said in a blog post.
Software development
#agentic-ai
Software development
fromTechCrunch
2 days ago

OpenAI updates its Agents SDK to help enterprises build safer, more capable agents | TechCrunch

OpenAI's updated SDK enhances agent development with sandboxing and in-distribution harness features for safer, more complex automated tasks.
Software development
fromTechCrunch
2 days ago

OpenAI updates its Agents SDK to help enterprises build safer, more capable agents | TechCrunch

OpenAI's updated SDK enhances agent development with sandboxing and in-distribution harness features for safer, more complex automated tasks.
Software development
fromZDNET
1 day ago

OpenAI's Codex Desktop can run your computer now - and has its own browser

Codex Desktop evolves from coding to broader productivity workflows while still targeting developers.
Software development
fromInfoWorld
2 days ago

Mastering the dull reality of sexy AI

The gap in enterprise AI lies in building effective systems for retrieval, evaluation, memory, and governance, not just access to models.
fromAxios
1 day ago

Anthropic's AI downgrade stings power users

"Claude has regressed to the point it cannot be trusted to perform complex engineering," an AMD senior director wrote in a widely shared post on GitHub.
Artificial intelligence
Artificial intelligence
fromFortune
1 day ago

Forget the chatbot wars. Demis Hassabis is thinking about something far bigger | Fortune

AI leadership should be global and diverse to ensure ethical development and deployment.
Artificial intelligence
fromAbove the Law
1 day ago

Unintentional AI Adoption Is Already Inside Your Company. The Only Question Is Whether You Know It. - Above the Law

AI is already integrated into companies through employee usage, often without intentional governance or awareness.
Artificial intelligence
fromTechCrunch
1 day ago

OpenAI takes aim at Anthropic with beefed-up Codex that gives it more power over your desktop | TechCrunch

OpenAI's Codex has been revamped with new features, including background operation capabilities, to compete with Anthropic's Claude Code.
Artificial intelligence
fromEngadget
2 days ago

There's yet another study about how bad AI is for our brains

AI assistance improves immediate performance but creates dependency, leading to decreased persistence and independent performance when the technology is removed.
Artificial intelligence
fromTheregister
2 days ago

LLMs fail in 8 out of 10 early differential diagnosis cases

AI models fail at early differential diagnosis in over 80% of cases, highlighting significant limitations for patient self-diagnosis.
Artificial intelligence
fromMedium
5 days ago

Why Your AI System Is Open-Loop

Open-loop AI systems audit spending after the fact, while closed-loop systems proactively control costs through continuous measurement and adjustment.
Artificial intelligence
fromFortune
3 days ago

Anthropic faces user backlash over reported performance issues in its Claude AI chatbot | Fortune

Anthropic faces backlash over Claude AI's declining performance and perceived lack of transparency amid rising user dissatisfaction and potential IPO plans.
Artificial intelligence
fromFuturism
5 days ago

OpenAI's Latest Thing It's Bragging About Is Actually Kind of Sad

The AI industry faces significant delays and cancellations in data center projects, impacting ambitious computing capacity goals.
Software development
fromInfoQ
1 month ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Software development
fromInfoWorld
1 month ago

How to build an AI agent that actually works

Successful agents embed intelligence within structured workflows at specific decision points rather than operating autonomously, combining deterministic processes with reasoning models where judgment is needed.
Artificial intelligence
fromFortune
3 weeks ago

Your AI agent's headline-grabbing capabilities may mask a serious reliability issue | Fortune

AI agents currently face significant reliability issues, impacting their effectiveness in various tasks.
Artificial intelligence
fromInfoWorld
4 weeks ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
fromInfoWorld
1 month ago

AI agents still need humans to teach them

AI agents need skills - specific procedural knowledge - to perform tasks well, but they can't teach themselves, a new research suggests. The authors of the research have developed a new benchmark, SkillsBench, which evaluates agentic AI performance on 84 tasks across 11 domains including healthcare, manufacturing, cybersecurity and software engineering. The researchers looked at each task under three conditions:
Artificial intelligence
Artificial intelligence
fromZDNET
1 month ago

These top 30 AI agents deliver a mix of functions and autonomy

Top AI agents mainly support enterprise workflows and research/information synthesis, with interfaces most abundant and several shared risks affecting jobs and operations.
[ Load more ]