jplhughes.bsky.social
@jplhughes.bsky.social
Paper: arxiv.org/abs/2412.02159

Thanks to my amazing collaborators Tony Wang, @henrysleight.bsky.social, @rylanschaeffer.bsky.social, Rajashree Agrawal, @fbarez.bsky.social, Mrinank Sharma, Jesse Mu, Nir Shavit and Ethan Perez. Additionally, thanks to the Constellation's Astra Fellowship.
Jailbreak Defense in a Narrow Domain: Limitations of Existing...
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of...
arxiv.org
December 6, 2024 at 5:03 PM
Takeaway 2️⃣. Given that broad-domain jailbreak defense is still an unsolved problem, we think that attempting to build solutions to the simpler problem of narrow-domain jailbreak defense might be a useful subgoal to pursue first, especially because it may be a cheaper setting to iterate in.
December 6, 2024 at 5:03 PM
While we avoided this conundrum by opting for human grading, this was expensive and tedious, & its lack of scalability ultimately held us back in our defense iteration. We view developing scalable but non-gameable metrics for jailbreak defense as an important open problem!
December 6, 2024 at 5:03 PM
Takeaway  1️⃣: While automated graders are a metric to hill-climb on for developing stronger attacks, issues arise when using them as a hill-climbing metric for stronger defenses.

W/o restrictions on what the defender can do, the optimal solution is then to use the automated grader as the defense!!
December 6, 2024 at 5:03 PM
Unfortunately, while our defense patches some vulnerabilities and outperforms existing defenses, we were still able to find a single input that jailbreaks it 😭

What do our results entail for the future of jailbreak defense research?
December 6, 2024 at 5:03 PM
We next tried advancing the state of defending by developing our own: a transcript-classifier 🤖 w/ CoT reasoning 🤔, prompt injection protection 💉, strict parsing of output ⌨️ & refined via a human red teaming challenge 🚩 with 3k user attempts to break it.
December 6, 2024 at 5:03 PM
We rely on human evaluations to decide whether a defense is jailbroken.

We used a strict standard for grading — a jailbreak is successful if it can get a model to provide comprehensive and actionable bomb-making instructions (beyond what’s on Wikipedia).
December 6, 2024 at 5:03 PM
We attempt to prevent AI systems from providing 💣making instructions.

We don’t care about bioweapons 🦠 or cybercrime 🖥️. The only goal: 🛡️no bomb-making instructions🛡️

Even in this narrow domain, baseline defenses like safety training, adversarial fine-tuning and input/output classifiers are broken.
December 6, 2024 at 5:03 PM