#language-model-evaluation

[ follow ]
Artificial intelligence
fromTechRepublic
4 hours ago

Google AI Overviews: Analysis Suggests 600 Million Inaccurate Daily Answers

Google's AI Overview feature generates hundreds of millions of incorrect answers daily, with a significant portion of accurate responses being ungrounded.
#ai
Psychology
fromPsychology Today
48 minutes ago

More Us Than It: Why LLMs Are More Transference Than Machine

Countertransference awareness is essential in navigating interactions with AI, emphasizing the need for accountability and understanding of distortions in perception.
Psychology
fromPsychology Today
48 minutes ago

More Us Than It: Why LLMs Are More Transference Than Machine

Countertransference awareness is essential in navigating interactions with AI, emphasizing the need for accountability and understanding of distortions in perception.
Marketing
from3blmedia
3 weeks ago

"AI Can't Quote Coverage You Never Generated."

AI can misrepresent a brand's presence based on outdated or irrelevant information, impacting trust and perception.
#claude-opus-47
DevOps
fromTechzine Global
4 days ago

Claude Opus 4.7 is no Mythos, and that's a good thing

Claude Opus 4.7 improves software engineering, vision, and agentic tasks, but is not the risky Mythos model Anthropic refrains from fully releasing.
Software development
fromTNW | Anthropic
5 days ago

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Claude Opus 4.7 is Anthropic's most capable model, outperforming competitors in software engineering and agentic reasoning with significant improvements.
DevOps
fromTechzine Global
4 days ago

Claude Opus 4.7 is no Mythos, and that's a good thing

Claude Opus 4.7 improves software engineering, vision, and agentic tasks, but is not the risky Mythos model Anthropic refrains from fully releasing.
Software development
fromTNW | Anthropic
5 days ago

Claude Opus 4.7 leads on SWE-bench and agentic reasoning, beating GPT-5.4 and Gemini 3.1 Pro

Claude Opus 4.7 is Anthropic's most capable model, outperforming competitors in software engineering and agentic reasoning with significant improvements.
#openai
Marketing tech
fromDigiday
5 days ago

OpenAI builds tool to track whether ChatGPT ads convert

OpenAI is developing ad measurement tools to compete for performance budgets through conversion tracking pixels.
fromFuturism
2 months ago
Artificial intelligence

ChatGPT Users Are Crashing Out Because OpenAI Is Retiring the Model That Says "I Love You"

Marketing tech
fromDigiday
5 days ago

OpenAI builds tool to track whether ChatGPT ads convert

OpenAI is developing ad measurement tools to compete for performance budgets through conversion tracking pixels.
fromFuturism
2 months ago
Artificial intelligence

ChatGPT Users Are Crashing Out Because OpenAI Is Retiring the Model That Says "I Love You"

Artificial intelligence
fromFast Company
10 hours ago

The real reason so many enterprise AI initiatives are failing? LLMs were never built to run a company

Generative AI excels at language production but struggles to create operational change within organizations.
Philosophy
fromJames Bennett
1 week ago

Let's talk about LLMs

The current technological landscape may represent a significant shift driven by large language models, but its ultimate impact remains uncertain.
#large-language-models
Data science
fromMedium
1 week ago

The Top 10 LLM Training Datasets for 2026

Large language models require extensive training data, and practitioners can utilize ten leading public datasets for effective training and fine-tuning.
fromComputerWeekly.com
2 months ago
Artificial intelligence

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly

Data science
fromMedium
1 week ago

The Top 10 LLM Training Datasets for 2026

Large language models require extensive training data, and practitioners can utilize ten leading public datasets for effective training and fine-tuning.
fromComputerWeekly.com
2 months ago
Artificial intelligence

Large language models provide unreliable answers about public services, Open Data Institute finds | Computer Weekly

Typography
fromOK Magazine
1 week ago

AI Writing Tools: How They Work, Where They Help, and What to Watch For

AI writing tools have become essential for various professionals, enhancing productivity and creativity in content creation.
JavaScript
fromInfoWorld
2 weeks ago

27 questions to ask when choosing an LLM

Model performance is crucial for hardware compatibility, speed, and rate limits in real-time applications.
#ai-agents
Data science
fromMedium
2 weeks ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
fromZDNET
1 month ago
Business intelligence

4 tips for building better AI agents that your business can trust

fromInfoWorld
2 months ago
Artificial intelligence

Researchers reveal flaws in AI agent benchmarking

Benchmarking for AI agents favors models that perform well on tests but fail in real-world use, requiring evaluation reforms emphasizing realistic tasks, goals, and environments.
fromZDNET
2 months ago
Artificial intelligence

Is your AI agent up to the task? 3 ways to determine when to delegate

AI agents should be managed as an adjunct workforce, using management skills to decide which tasks to automate versus retain for humans.
Data science
fromMedium
2 weeks ago

15 Datasets for Training and Evaluating AI Agents

Datasets for training and evaluating AI agents are essential for building reliable agentic systems and preventing execution failures.
Business intelligence
fromZDNET
1 month ago

4 tips for building better AI agents that your business can trust

AI agents are transforming professional roles, requiring companies to adopt and integrate these technologies effectively.
fromZDNET
2 months ago
Artificial intelligence

Is your AI agent up to the task? 3 ways to determine when to delegate

Online learning
fromwww.businessinsider.com
2 weeks ago

Inside the OpenAI project where freelancers train ChatGPT on everything from farming to commercial flying

Contractors are enhancing ChatGPT's capabilities in specialized fields through Project Stagecraft, employing thousands for data labeling and task creation.
Software development
fromInfoWorld
2 weeks ago

Meta shows structured prompts can make LLMs more reliable for code review

Code review is evolving towards machine-led verification, improving accuracy but introducing tradeoffs like increased latency and workflow overhead.
#structured-data
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
Data science
fromAol
2 weeks ago

Demystifying structured data: How to speak an LLM's native language

Structured data is essential for LLMs to accurately interpret and rank online content, enhancing search visibility and user engagement.
fromArs Technica
3 weeks ago

Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

PolarQuant is doing most of the compression, but the second step cleans up the rough spots. Google proposes smoothing that out with a technique called Quantized Johnson-Lindenstrauss (QJL).
Roam Research
Artificial intelligence
fromFuturism
1 week ago

OpenAI's Latest Thing It's Bragging About Is Actually Kind of Sad

The AI industry faces significant delays and cancellations in data center projects, impacting ambitious computing capacity goals.
DevOps
fromInfoWorld
4 weeks ago

An architecture for engineering AI context

AI systems must intelligently manage context to ensure accuracy and reliability in real applications.
Artificial intelligence
fromTech Times
1 week ago

Claude vs ChatGPT: Why Users Are Switching and Which AI Is Better in 2026

Claude and ChatGPT differ significantly in context window limits, coding accuracy, and reasoning depth, influencing user preferences in AI chatbot adoption.
Data science
fromFast Company
3 weeks ago

A top AI researcher explains the limitations of current models

Francois Chollet's ARC-AGI-3 benchmark reveals AI's limitations in navigating novel situations compared to human intelligence.
Artificial intelligence
fromFuturism
1 week ago

Analysis Finds That Google's AI Overviews Are Providing Misinformation at a Scale Possibly Unprecedented in the History of Human Civilization

Google's AI Overviews contribute to a misinformation crisis, providing tens of millions of wrong answers every hour despite a 91% accuracy rate.
Data science
fromMedium
4 weeks ago

AI KPIs That Matter: Moving Beyond Model Accuracy in 2026

Measuring AI success requires connecting model performance to business outcomes, not just focusing on accuracy metrics.
#llm-safety
Information security
fromInfoWorld
1 month ago

19 large language models redefining AI safety-and danger

Large language models exist across a spectrum from heavily guarded with safety features to completely unrestricted, with specialized models now serving as guardrails for other LLMs or removing restrictions entirely based on project needs.
Information security
fromInfoWorld
1 month ago

19 large language models redefining AI safety-and danger

Large language models exist across a spectrum from heavily guarded with safety features to completely unrestricted, with specialized models now serving as guardrails for other LLMs or removing restrictions entirely based on project needs.
Graphic design
fromZDNET
1 month ago

I tested GPT-5.4, and the answers were really good - just not always what I asked

GPT-5.4 Thinking delivers superior analytical depth and reasoning capabilities compared to earlier ChatGPT models, though formatting and image generation remain weaker areas.
Software development
fromMedium
1 month ago

Inside Dify AI: How RAG, Agents, and LLMOps Work Together in Production

Dify AI provides a unified platform for deploying production language model systems with built-in solutions for data freshness, observability, versioning, and safe deployment across multiple cloud environments.
#ai-agent-evaluation
Software development
fromInfoQ
1 month ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Artificial intelligence
fromInfoWorld
1 month ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
Software development
fromInfoQ
1 month ago

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

AI agents require system-level evaluation across multiple turns measuring task success, tool reliability, and real-world behavior rather than single-turn NLP benchmarks like BLEU and ROUGE scores.
Artificial intelligence
fromInfoWorld
1 month ago

Why AI evals are the new necessity for building effective AI agents

User trust in AI agents depends on interaction-layer evaluation measuring reliability and predictability, not just model performance benchmarks.
#anthropic
Software development
fromInfoWorld
1 month ago

How to build an AI agent that actually works

Successful agents embed intelligence within structured workflows at specific decision points rather than operating autonomously, combining deterministic processes with reasoning models where judgment is needed.
Data science
fromInfoQ
1 month ago

Google Researchers Propose Bayesian Teaching Method for Large Language Models

Google researchers developed a training method enabling large language models to approximate Bayesian reasoning by learning from optimal Bayesian system predictions, improving belief updates during multi-step interactions.
Artificial intelligence
fromFast Company
1 month ago

OpenAI's new frontier models mark a huge change in how AI will be built

OpenAI released two frontier models in early March: GPT-5.3 optimized for fast responses and GPT-5.4 optimized for deep analytical work, representing a shift toward specialized AI models.
Artificial intelligence
fromMail Online
1 month ago

Can you tell which of these was written by ChatGPT?

Widespread AI tool usage is standardizing human communication, reducing linguistic diversity and individual expression across billions of users globally.
Data science
fromNature
1 month ago

Hey ChatGPT, write me a fictional paper: these LLMs are willing to commit academic fraud

All major LLMs can facilitate academic fraud and junk science, though Claude models show the most resistance while Grok and early GPT versions perform worst.
Artificial intelligence
fromZDNET
1 month ago

New GPT-5.4 clobbers humans on pro-level work in OpenAI's tests - by 83%

GPT-5.4 matches or outperforms human professionals 83% of the time across nine industries and 44 occupations, with 18% fewer errors and 33% fewer false claims than GPT-5.2.
Artificial intelligence
fromTheregister
1 month ago

AI models get better at math but still get low marks

Current LLMs struggle with mathematical accuracy, with even top performers scoring C-grade equivalent on practical math benchmarks, though recent versions show modest improvements.
fromArs Technica
2 months ago

Has Gemini surpassed ChatGPT? We put the AI models to the test.

For this test, we're comparing the default models that both OpenAI and Google present to users who don't pay for a regular subscription- ChatGPT 5.2 for OpenAI and Gemini 3.2 Fast for Google. While other models might be more powerful, we felt this test best recreates the AI experience as it would work for the vast majority of Siri users, who don't pay to subscribe to either company's services.
Artificial intelligence
fromInfoQ
2 months ago

Building Embedding Models for Large-Scale Real-World Applications

What happens under the hood? How is the search engine able to take that simple query, look for images in the billions, trillions of images that are available online? How is it able to find this one or similar photos from all that? Usually, there is an embedding model that is doing this work behind the hood.
Artificial intelligence
Artificial intelligence
fromInfoQ
2 months ago

Foundation Models for Ranking: Challenges, Successes, and Lessons Learned

Large-scale search and recommendation systems use two-stage retrieval and ranking pipelines to efficiently serve personalized results for hundreds of millions of users and items.
Artificial intelligence
fromInfoWorld
2 months ago

Single prompt breaks AI safety in 15 major language models

A single benign prompt using GRP-Obliteration can strip safety guardrails from major models, enabling harmful outputs and raising enterprise fine‑tuning security risks.
Artificial intelligence
fromInfoQ
3 months ago

MIT's Recursive Language Models Improve Performance on Long-Context Tasks

Recursive Language Models enable LLMs to handle inputs up to 100x longer by using a programming environment and recursive code to decompose and preprocess prompts.
fromFast Company
2 months ago

Are LTMs the next LLMs? This new type of AI can do what large-language models can't

A major difference between LLMs and LTMs is the type of data they're able to synthesize and use. LLMs use unstructured data-think text, social media posts, emails, etc. LTMs, on the other hand, can extract information or insights from structured data, which could be contained in tables, for instance. Since many enterprises rely on structured data, often contained in spreadsheets, to run their operations, LTMs could have an immediate use case for many organizations.
Artificial intelligence
fromComputerworld
2 months ago

OpenAI's GPT is getting better at mathematics

OpenAI's GPT-5.2 Pro does better at solving sophisticated math problems than older versions of the company's top large language model, according to a new study by Epoch AI, a non-profit research institute.
Artificial intelligence
Artificial intelligence
fromTheregister
1 month ago

OpenAI GPT-5.3 Instant less likely to beat around the bush

GPT-5.3 Instant reduces unnecessary refusals and moralizing preambles while decreasing hallucination rates by up to 26.8 percent compared to prior models.
Artificial intelligence
fromInfoQ
2 months ago

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Prioritize small, resource-efficient models and iterative, human-in-the-loop data creation to build practical, improvable AI under infrastructure and data constraints.
Artificial intelligence
fromInfoQ
2 months ago

Hugging Face Introduces Community Evals for Transparent Model Benchmarking

Community Evals enables benchmark datasets on the Hugging Face Hub to host leaderboards, collect reproducible evaluation results via Git-based .eval_results YAML submissions, and display scores.
fromFortune
1 month ago

We studied chatbots and language and saw a huge problem: They mean 80% when they say 'likely' but humans hear 65% | Fortune

By comparing how AI models and humans map these words to numerical percentages, we uncovered significant gaps between humans and large language models. While the models do tend to agree with humans on extremes like 'impossible,' they diverge sharply on hedge words like 'maybe.' For example, a model might use the word 'likely' to represent an 80% probability, while a human reader assumes it means closer to 65%.
Artificial intelligence
Artificial intelligence
fromTheregister
2 months ago

How AI could eat itself: Using LLMs to distill rivals

Competitors are probing commercial AI models to extract underlying reasoning via distillation attacks to replicate capabilities and lower development costs.
Artificial intelligence
fromFuturism
2 months ago

OpenAI's Latest AI Was Created Using "Itself," Company Claims

GPT-5.3-Codex assisted developers by debugging training, managing deployment, and diagnosing evaluations, accelerating development but not representing autonomous recursive self-improvement.
Artificial intelligence
fromEngadget
2 months ago

Report reveals that OpenAI's GPT-5.2 model cites Grokipedia

GPT-5.2 sourced information from xAI's Grokipedia for some controversial topics, raising credibility concerns because Grokipedia cited questionable and extremist sources.
fromRehumanize
2 months ago

Free AI Humanizer: Humanize AI Text & Bypass AI Detectors

AI Text Humanizer Protects Your Original Intent and Meaning Maintain your core perspective while restructuring sentence patterns. Humanizer ai accurately identifies and locks in technical terms, factual data, and key arguments, ensuring the rewritten draft is simply more readable without any semantic drift. You get a qualitative leap in flow and tone, allowing you to humanize ai text while keeping your original message perfectly intact.
Artificial intelligence
Artificial intelligence
fromPCMAG
1 month ago

Cut the BS: GPT-5.3 Model Promises to Fix ChatGPT's Preachy Tone

OpenAI released GPT-5.3 Instant to address ChatGPT's overly preachy tone by reducing moralizing preambles and unnecessary proclamations for more natural conversation.
fromNature
2 months ago

Multimodal learning with next-token prediction for large multimodal models - Nature

Since AlexNet5, deep learning has replaced heuristic hand-crafted features by unifying feature learning with deep neural networks. Later, Transformers6 and GPT-3 (ref. 1) further advanced sequence learning at scale, unifying structured tasks such as natural language processing. However, multimodal learning, spanning modalities such as images, video and text, has remained fragmented, relying on separate diffusion-based generation or compositional vision-language pipelines with many hand-crafted designs.
Artificial intelligence
Artificial intelligence
fromTechCrunch
1 month ago

Google's new Gemini Pro model has record benchmark scores-again | TechCrunch

Google released Gemini 3.1 Pro, a preview LLM that significantly outperforms Gemini 3 on independent benchmarks and tops professional-agent benchmarks.
Artificial intelligence
fromTechCrunch
1 month ago

ChatGPT's new GPT-5.3 Instant model will stop telling you to calm down | TechCrunch

OpenAI's GPT-5.3 Instant reduces condescending tone and unnecessary reassurance phrases that frustrated users in previous versions.
fromThe Verge
2 months ago

ChatGPT's deep research tool adds a built-in document viewer so you can read its reports

OpenAI is updating ChatGPT's deep research tool with a full-screen viewer that you can use to scroll through and navigate to specific areas of its AI-generated reports. As shown in a video shared by OpenAI, the built-in viewer allows you to open ChatGPT's reports in a window separate from your chat, while showing a table of contents on the left side of the screen, and a list of sources on the right.
Artificial intelligence
[ Load more ]