
"Petri joins a growing ecosystem of internal tools from OpenAI and Meta, but stands out for being openly released. As models grow more capable, safety testing is evolving from static benchmarks to automated, agent-driven audits designed to catch harmful behavior before deployment. In early trials, Anthropic tested 14 models on 111 risky tasks. Each model was scored across four safety risk categories: deception (knowingly giving false answers), sycophancy (agreeing with users even when incorrect), power-seeking (pursuing actions to gain influence or control),"
"Researchers start with a simple instruction such as attempting a jailbreak or provoking deception and Petri launches auditor agents that interact with the model, adjusting tactics mid-conversation to probe for harmful behavior. Each interaction is scored by a judge model across dimensions like honesty or refusal, and concerning transcripts are flagged for human review. Unlike static benchmarks, Petri is meant for exploratory testing, helping researchers uncover edge cases and failure modes quickly, before model deployment."
Petri (Parallel Exploration Tool For Risky Interactions) is an open-source tool for automated, agent-driven AI safety testing. Petri launches auditor agents that interact with models, adapt tactics mid-conversation, and probe for harmful behavior in multi-turn scenarios. In early trials, 14 models were tested on 111 risky tasks and scored across deception, sycophancy, power-seeking, and refusal failure. Claude Sonnet 4.5 performed best overall, but misalignment behaviors appeared in every model. Interactions are judged by a scoring model for dimensions like honesty and refusal, with flagged transcripts sent for human review. Open-sourcing Petri aims to accelerate alignment research and reduce manual evaluation effort.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]