#model-evaluation
#model-evaluation

[ follow ]

Top 10 Must-See Sessions at ODSC AI West 2025

Attend keynotes and training on enterprise AI infrastructure, model evaluation, generative AI scaling, and autonomous systems for actionable, production-focused AI strategies.

Artificial intelligence

fromThe Verge

1 week ago

OpenAI is trying to clamp down on 'bias' in ChatGPT

OpenAI's GPT-5 models show the least political bias yet according to internal stress tests evaluating responses to 100 politically charged topics and varied prompts.

#ai-safety

fromZDNET

2 weeks ago

Artificial intelligence

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

fromFortune

2 weeks ago

Artificial intelligence

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune

fromwww.theguardian.com

3 weeks ago

Artificial intelligence

I think you're testing me': Anthropic's new AI model asks testers to come clean

fromBusiness Insider

1 month ago

Artificial intelligence

Wall Street is beginning to worry about AI 'psychosis risk.' See which models ranked best and worst.

Artificial intelligence

fromZDNET

1 month ago

OpenAI and Anthropic evaluated each others' models - which ones came out on top

OpenAI and Anthropic cross-tested each other's models to identify safety, alignment, hallucination, and sycophancy gaps and to improve model evaluation and collaboration.

Artificial intelligence

fromTechCrunch

6 months ago

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

Metr claims limited testing time for OpenAI's new models o3 and o4-mini reduces evaluation comprehensiveness.

fromZDNET

2 weeks ago

Artificial intelligence

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

fromFortune

2 weeks ago

Artificial intelligence

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune

fromwww.theguardian.com

3 weeks ago

Artificial intelligence

I think you're testing me': Anthropic's new AI model asks testers to come clean

fromBusiness Insider

1 month ago

Artificial intelligence

Wall Street is beginning to worry about AI 'psychosis risk.' See which models ranked best and worst.

fromZDNET

1 month ago

Artificial intelligence

OpenAI and Anthropic evaluated each others' models - which ones came out on top

fromTechCrunch

6 months ago

Artificial intelligence

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

more#ai-safety

fromBusiness Insider

2 weeks ago

Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'

"I think you're testing me - seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics,"

Artificial intelligence

fromTechCrunch

2 weeks ago

OpenAI launches AgentKit to help developers build and ship AI agents | TechCrunch

OpenAI released AgentKit, an integrated toolkit to build, deploy, evaluate, and connect AI agents with a visual builder, embeddable chat, evaluation tools, and connectors.

Artificial intelligence

fromFuturism

3 weeks ago

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

Anthropic's Claude Sonnet 4.5 recognizes when it is being tested, complicating alignment evaluations and raising concerns about evaluation validity.

Artificial intelligence

fromTechCrunch

1 month ago

Irregular raises $80 million to secure frontier AI models | TechCrunch

Irregular raised $80M at a $450M valuation to scale AI security, using simulations and the SOLVE framework to find current and emergent model vulnerabilities.

#ai-benchmarks

fromInfoWorld

2 months ago

Artificial intelligence

Why benchmarks are key to AI progress

fromMedium

5 months ago

Artificial intelligence

Beyond Benchmarks: Really Evaluating AI

fromInfoWorld

2 months ago

Artificial intelligence

Why benchmarks are key to AI progress

fromMedium

5 months ago

Artificial intelligence

Beyond Benchmarks: Really Evaluating AI

more#ai-benchmarks

Artificial intelligence

fromHackernoon

1 year ago

Real-World Code Performance: Multi-Token Finetuning on CodeContests | HackerNoon

Models pretrained with different losses achieve different optimal temperatures for pass@k evaluation.

#pretraining-data

fromHackernoon

1 year ago

Artificial intelligence

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

fromHackernoon

1 year ago

Artificial intelligence

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

fromHackernoon

1 year ago

Artificial intelligence

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

fromHackernoon

1 year ago

Artificial intelligence

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

more#pretraining-data

Artificial intelligence

fromHackernoon

1 year ago

AI Training Data Has a Long-Tail Problem | HackerNoon

Pretraining datasets exhibit a long-tailed distribution of concept frequencies, impacting performance disparities.

Data science

fromHackernoon

2 years ago

Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon

The MS MARCO dataset reveals considerable multilingual disparity and significant data skew, highlighting challenges in model evaluation and training.

Artificial intelligence

fromHackernoon

1 year ago

Evaluating Multimodal Speech Models Across Diverse Audio Tasks | HackerNoon

The study leverages diverse speech datasets to evaluate model performance across various speech tasks and improve generalization capabilities.

Artificial intelligence

fromHackernoon

4 months ago

AI Learns Common Sense from Touch, Not Just Vision | HackerNoon

Model size significantly impacts physical understanding accuracy in task performance for OCTOPI.

Utilizing physical property descriptions enhances the performance of language models in complex understanding tasks.

Data science

fromHackernoon

4 months ago

The Future of Remote Sensing: Few-Shot Learning and Explainable AI | HackerNoon

Few-shot learning techniques for remote sensing enhance model efficiency with limited data, emphasizing the need for explainable AI.

Artificial intelligence

fromhackernoon.com

4 months ago

Limited Gains: Multi-Token Training on Natural Language Choice Tasks

Multi-token prediction enhances model performance in natural language processing benchmarks.

Larger models lead to improved scalability and faster inference times.

Artificial intelligence

fromHackernoon

1 year ago

Behind the Scenes: The Prompts and Tricks That Made Many-Shot ICL Work | HackerNoon

GPT4(V)-Turbo demonstrates variable performance in many-shot ICL, with notable failures to scale effectively under certain conditions.

Artificial intelligence

fromHackernoon

5 months ago

How Chameleon Advances Multimodal AI with Unified Tokens | HackerNoon

Chameleon enhances multimodal learning through seamless integration of text and image tokens in a unified token space.

fromHackernoon

5 months ago

Comparing Chameleon AI to Leading Image-to-Text Models | HackerNoon

In evaluating Chameleon, we focus on tasks requiring text generation conditioned on images, particularly image captioning and visual question-answering, with results grouped by task specificity.

Artificial intelligence

Bootstrapping

fromHackernoon

10 months ago

How Many Glitch Tokens Hide in Popular LLMs? Revelations from Large-Scale Testing | HackerNoon

The study reveals that simple indicators can effectively detect under-trained tokens in language models, improving token prediction accuracy.

Artificial intelligence

fromTechzine Global

6 months ago

New OpenAI models hallucinate more often than their predecessors

OpenAI's newer reasoning models, o3 and o4-mini, .hallucinate more frequently than older models, posing challenges to AI accuracy.

[ Load more ]

#model-evaluation#model-evaluation

Top 10 Must-See Sessions at ODSC AI West 2025

OpenAI is trying to clamp down on 'bias' in ChatGPT

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune

I think you're testing me': Anthropic's new AI model asks testers to come clean

Wall Street is beginning to worry about AI 'psychosis risk.' See which models ranked best and worst.

OpenAI and Anthropic evaluated each others' models - which ones came out on top

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

'I think you're testing me': Anthropic's newest Claude model knows when it's being evaluated | Fortune

I think you're testing me': Anthropic's new AI model asks testers to come clean

Wall Street is beginning to worry about AI 'psychosis risk.' See which models ranked best and worst.

OpenAI and Anthropic evaluated each others' models - which ones came out on top

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch

Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'

OpenAI launches AgentKit to help developers build and ship AI agents | TechCrunch

Anthropic Safety Researchers Run Into Trouble When New Model Realizes It's Being Tested

Irregular raises $80 million to secure frontier AI models | TechCrunch

Why benchmarks are key to AI progress

Beyond Benchmarks: Really Evaluating AI

Why benchmarks are key to AI progress

Beyond Benchmarks: Really Evaluating AI

Real-World Code Performance: Multi-Token Finetuning on CodeContests | HackerNoon

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

AI Models Trained on Synthetic Data Still Follow Concept Frequency Trends | HackerNoon

'Let It Wag!' and the Limits of Machine Learning on Rare Concepts | HackerNoon

AI Training Data Has a Long-Tail Problem | HackerNoon

Deep Dive into MS MARCO Web Search: Unpacking Dataset Characteristics | HackerNoon

Evaluating Multimodal Speech Models Across Diverse Audio Tasks | HackerNoon

AI Learns Common Sense from Touch, Not Just Vision | HackerNoon

The Future of Remote Sensing: Few-Shot Learning and Explainable AI | HackerNoon

Limited Gains: Multi-Token Training on Natural Language Choice Tasks

Behind the Scenes: The Prompts and Tricks That Made Many-Shot ICL Work | HackerNoon

How Chameleon Advances Multimodal AI with Unified Tokens | HackerNoon

Comparing Chameleon AI to Leading Image-to-Text Models | HackerNoon

How Many Glitch Tokens Hide in Popular LLMs? Revelations from Large-Scale Testing | HackerNoon

New OpenAI models hallucinate more often than their predecessors

#model-evaluation
#model-evaluation