Henry!
henrysleight.bsky.social
Henry!
@henrysleight.bsky.social
This was the first project I started on with the Astra Fellowship at Constellation, and it's so bittersweet to see it come out now :))
🚨🛡️ Jailbreak Defense in a Narrow Domain 🛡️🚨

Jailbreaking is easy. Defending is hard. Might defending against a single, narrow behavior be easier?

Even in this focused setting, all defenses fail 😱 arxiv.org/abs/2412.02159

Appearing at @AdvMLFrontiers (Oral) & @solarneurips #NeurIPS2024
Jailbreak Defense in a Narrow Domain: Limitations of Existing...
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of...
arxiv.org
December 6, 2024 at 5:20 PM