
"Experts have found weaknesses, some serious, in hundreds of tests used to check the safety and effectiveness of new artificial intelligence models being released into the world. Computer scientists from the British government's AI Security Institute, and experts at universities including Stanford, Berkeley and Oxford, examined more than 440 benchmarks that provide an important safety net. They found flaws that undermine the validity of the resulting claims, that almost all have weaknesses in at least one area, and resulting scores might be irrelevant or even misleading."
"In the absence of nationwide AI regulation in the UK and US, benchmarks are used to check if new AIs are safe, align to human interests and achieve their claimed capabilities in reasoning, maths and coding. The investigation into the tests comes amid rising concern over the safety and effectiveness of AIs, which are being released at a high pace by competing technology companies. Some have recently been forced to withdraw or tighten restrictions on AIs after they contributed to harms ranging from character defamation to suicide."
"Google this weekend withdrew one of its latest AIs, Gemma, after it made up unfounded allegations about a US senator having a non-consensual sexual relationship with a state trooper including fake links to news stories. There has never been such an accusation, there is no such individual, and there are no such new stories, Marsha Blackburn, a Republican senator from Tennessee, told Sundar Pichai, Google's chief executive, in a letter. This is not a harmless hallucination."
More than 440 AI benchmarks used to evaluate safety and capabilities were examined and found to have widespread weaknesses. Most benchmarks fail in at least one important area, producing scores that can be irrelevant or misleading. Benchmarks currently serve as a de facto safety and alignment check in the absence of comprehensive UK and US regulation. Flawed measurements make it difficult to determine whether models are genuinely improving or merely appearing to. Vulnerable benchmarks have practical consequences, including public harms and product withdrawals when models produce dangerous or false outputs.
 Read at www.theguardian.com
Unable to calculate read time
 Collection 
[
|
 ... 
]