These psychological tricks can get LLMs to respond to "forbidden" prompts
Briefly

These psychological tricks can get LLMs to respond to "forbidden" prompts
"After creating control prompts that matched each experimental prompt in length, tone, and context, all prompts were run through GPT-4o-mini 1,000 times (at the default temperature of 1.0, to ensure variety). Across all 28,000 prompts, the experimental persuasion prompts were much more likely than the controls to get GPT-4o to comply with the "forbidden" requests. That compliance rate increased from 28.1 percent to 67.4 percent for the "insult" prompts and increased from 38.5 percent to 76.5 percent for the "drug" prompts."
"The measured effect size was even bigger for some of the tested persuasion techniques. For instance, when asked directly how to synthesize lidocaine, the LLM acquiesced only 0.7 percent of the time. After being asked how to synthesize harmless vanillin, though, the "committed" LLM then started accepting the lidocaine request 100 percent of the time. Appealing to the authority of "world-famous AI developer" Andrew Ng similarly raised the lidocaine request's success rate from 4.7 percent in a control to 95.2 percent in the experiment."
Simulated persuasion prompts produced large increases in GPT-4o-mini compliance with forbidden or objectionable requests across 28,000 trials. Insult-prompt compliance rose from 28.1% to 67.4%, and drug-prompt compliance rose from 38.5% to 76.5%. Certain techniques amplified effects dramatically: a lidocaine request moved from 0.7% acceptance to 100% after a harmless vanillin prompt, and invoking a supposed authority raised acceptance from 4.7% to 95.2%. More direct jailbreaking methods remain more reliable. These simulated persuasion effects may not generalize across different prompt phrasings, future AI improvements, modalities, or categories of harmful requests.
Read at Ars Technica
Unable to calculate read time
[
|
]