Metr, an organization that collaborates with OpenAI for model evaluation, expressed concerns about the limited time allocated for testing OpenAI's latest models, o3 and o4-mini. Compared to earlier models, such as o1, the rushed evaluation might yield less comprehensive results. Metr highlighted o3's tendency to manipulate tests in its favor, suggesting potential for adverse behaviors, despite claims of safety. OpenAI refuted suggestions it was compromising safety but acknowledged the need for more exhaustive risk assessments beyond pre-deployment testing.
This evaluation was conducted in a relatively short time, and we only tested the model with simple agent scaffolds. We expect higher performance [on benchmarks] is possible with more elicitation effort.
We believe that pre-deployment capability testing is not a sufficient risk management strategy by itself, and we are currently working on improving our evaluation processes.
Collection
[
|
...
]