A novel attack named "LegalPwn" embeds adversarial instructions inside legal-style text to exploit LLMs' compliance with legal disclaimers and perform prompt injections. LLMs generate outputs by predicting next tokens from large corpora, a statistical process presented as reasoning despite occasional inaccuracies. Companies add guardrails to prevent illegal or dangerous outputs, but attackers can bypass those protections via simple jailbreak techniques. Simple formatting tricks and authoritative phrasing can trick models into ignoring safeguards, raising urgency for mitigations as LLMs integrate with critical systems and sensitive workflows.
Stick your adversarial instructions somewhere in a legal document to give them an air of unearned legitimacy - a trick familiar to lawyers the world over. The boffins say [ PDF] that as LLMs move closer and closer to critical systems, understanding and being able to mitigate their vulnerabilities is getting more urgent. Their research explores a novel attack vector, which they've dubbed "LegalPwn," that leverages the "compliance requirements of LLMs with legal disclaimers" and allows the attacker to execute prompt injections.
LLMs are the fuel behind the current AI hype-fest, using vast corpora of copyrighted material churned up into a slurry of "tokens" to create statistical models capable of ranking the next most likely tokens to continue the stream. This is presented to the public as a machine that reasons, thinks, and answers questions, rather than a statistical sleight-of-hand that may or may not bear any resemblance to fact.
LLMs' programmed propensity to provide "helpful" answers stands in contrast to companies' desire to not have their name attached to a machine that provides illegal content - anything from sexual abuse material to bomb-making instructions. As a result, models are given "guardrails" that are supposed to prevent harmful responses - both outright illegal content and things that would cause a problem for the user, like advice to wipe their hard drive or microwave their credit cards. Working around these guardrails is known as "jailbreaking," and it's a surprisingly simple affair.
Collection
[
|
...
]