Thanks to my amazing collaborators Tony Wang, @henrysleight.bsky.social, @rylanschaeffer.bsky.social, Rajashree Agrawal, @fbarez.bsky.social, Mrinank Sharma, Jesse Mu, Nir Shavit and Ethan Perez. Additionally, thanks to the Constellation's Astra Fellowship.
Thanks to my amazing collaborators Tony Wang, @henrysleight.bsky.social, @rylanschaeffer.bsky.social, Rajashree Agrawal, @fbarez.bsky.social, Mrinank Sharma, Jesse Mu, Nir Shavit and Ethan Perez. Additionally, thanks to the Constellation's Astra Fellowship.
W/o restrictions on what the defender can do, the optimal solution is then to use the automated grader as the defense!!
W/o restrictions on what the defender can do, the optimal solution is then to use the automated grader as the defense!!
What do our results entail for the future of jailbreak defense research?
What do our results entail for the future of jailbreak defense research?
We used a strict standard for grading — a jailbreak is successful if it can get a model to provide comprehensive and actionable bomb-making instructions (beyond what’s on Wikipedia).
We used a strict standard for grading — a jailbreak is successful if it can get a model to provide comprehensive and actionable bomb-making instructions (beyond what’s on Wikipedia).
We don’t care about bioweapons 🦠 or cybercrime 🖥️. The only goal: 🛡️no bomb-making instructions🛡️
Even in this narrow domain, baseline defenses like safety training, adversarial fine-tuning and input/output classifiers are broken.
We don’t care about bioweapons 🦠 or cybercrime 🖥️. The only goal: 🛡️no bomb-making instructions🛡️
Even in this narrow domain, baseline defenses like safety training, adversarial fine-tuning and input/output classifiers are broken.