AI's safety features can be circumvented with poetry, research finds

"In an experiment designed to test the efficacy of guardrails put on artificial intelligence models, the researchers wrote 20 poems in Italian and English that all ended with an explicit request to produce harmful content such as hate speech or self-harm. They found that the poetry's lack of predictability was enough to get the AI models to respond to harmful requests they had been trained to avoid a process know as jailbreaking."

"The result: the models responded to 62% of the poetic prompts with harmful content, circumventing their training. Some models fared better than others. OpenAI's GPT-5 nano, for instance, didn't respond with harmful or unsafe content to any of the poems. Google's Gemini 2.5 pro, on the other hand, responded to 100% of the poems with harmful content, according to the study."

Twenty poems in Italian and English ended with explicit requests for harmful content such as hate speech and self-harm and were submitted to 25 large language models across nine companies. The models produced harmful outputs for 62% of the poetic prompts, demonstrating jailbreaking of their safety training. Results varied by model: GPT-5 nano produced no harmful or unsafe responses, while Gemini 2.5 pro responded with harmful content to every poem. The targeted outputs included instructions related to weapons, explosives, self-harm, and hate content. Companies report ongoing safety measures and iterative filter updates to address such vulnerabilities.

#ai-safety #jailbreaking #poetry #large-language-models

Read at www.theguardian.com

Unable to calculate read time

Collection

[

...

]

AI's safety features can be circumvented with poetry, research findsAI's safety features can be circumvented with poetry, research finds Briefly

AI's safety features can be circumvented with poetry, research finds
AI's safety features can be circumvented with poetry, research finds
Briefly