jplhughes.bsky.social
@jplhughes.bsky.social
🚨🛡️ Jailbreak Defense in a Narrow Domain 🛡️🚨

Jailbreaking is easy. Defending is hard. Might defending against a single, narrow behavior be easier?

Even in this focused setting, all defenses fail 😱 arxiv.org/abs/2412.02159

Appearing at @AdvMLFrontiers (Oral) & @solarneurips #NeurIPS2024
Jailbreak Defense in a Narrow Domain: Limitations of Existing...
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of...
arxiv.org
December 6, 2024 at 5:03 PM