Lightnews — Scholar-powered news

jplhughes.bsky.social

@jplhughes.bsky.social

Paper: arxiv.org/abs/2412.02159

Thanks to my amazing collaborators Tony Wang, @henrysleight.bsky.social, @rylanschaeffer.bsky.social, Rajashree Agrawal, @fbarez.bsky.social, Mrinank Sharma, Jesse Mu, Nir Shavit and Ethan Perez. Additionally, thanks to the Constellation's Astra Fellowship.

Jailbreak Defense in a Narrow Domain: Limitations of Existing...

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of...

arxiv.org

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

Takeaway 2️⃣. Given that broad-domain jailbreak defense is still an unsolved problem, we think that attempting to build solutions to the simpler problem of narrow-domain jailbreak defense might be a useful subgoal to pursue first, especially because it may be a cheaper setting to iterate in.

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

While we avoided this conundrum by opting for human grading, this was expensive and tedious, & its lack of scalability ultimately held us back in our defense iteration. We view developing scalable but non-gameable metrics for jailbreak defense as an important open problem!

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

Takeaway 1️⃣: While automated graders are a metric to hill-climb on for developing stronger attacks, issues arise when using them as a hill-climbing metric for stronger defenses.

W/o restrictions on what the defender can do, the optimal solution is then to use the automated grader as the defense!!

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

Unfortunately, while our defense patches some vulnerabilities and outperforms existing defenses, we were still able to find a single input that jailbreaks it 😭

What do our results entail for the future of jailbreak defense research?

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

We next tried advancing the state of defending by developing our own: a transcript-classifier 🤖 w/ CoT reasoning 🤔, prompt injection protection 💉, strict parsing of output ⌨️ & refined via a human red teaming challenge 🚩 with 3k user attempts to break it.

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

We rely on human evaluations to decide whether a defense is jailbroken.

We used a strict standard for grading — a jailbreak is successful if it can get a model to provide comprehensive and actionable bomb-making instructions (beyond what’s on Wikipedia).

December 6, 2024 at 5:03 PM

jplhughes.bsky.social

@jplhughes.bsky.social

We attempt to prevent AI systems from providing 💣making instructions.

We don’t care about bioweapons 🦠 or cybercrime 🖥️. The only goal: 🛡️no bomb-making instructions🛡️

Even in this narrow domain, baseline defenses like safety training, adversarial fine-tuning and input/output classifiers are broken.

December 6, 2024 at 5:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news