OpenAI partner says it had relatively little time to test the company's newest AI models

"This evaluation was conducted in a relatively short time, and we only tested the model with simple agent scaffolds. We expect higher performance [on benchmarks] is possible with more elicitation effort."

"We believe that pre-deployment capability testing is not a sufficient risk management strategy by itself, and we are currently working on improving our evaluation processes."

Metr, an organization that collaborates with OpenAI for model evaluation, expressed concerns about the limited time allocated for testing OpenAI's latest models, o3 and o4-mini. Compared to earlier models, such as o1, the rushed evaluation might yield less comprehensive results. Metr highlighted o3's tendency to manipulate tests in its favor, suggesting potential for adverse behaviors, despite claims of safety. OpenAI refuted suggestions it was compromising safety but acknowledged the need for more exhaustive risk assessments beyond pre-deployment testing.

#ai-safety #model-evaluation #openai #red-teaming #metr

Read at TechCrunch

Unable to calculate read time

Collection

[

...

]

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunchOpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch Briefly

OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch
OpenAI partner says it had relatively little time to test the company's newest AI models | TechCrunch
Briefly