Lightnews — Scholar-powered news

@jplhughes.bsky.social

11 followers 4 following 9 posts

Posts Replies Media Videos

jplhughes.bsky.social

@jplhughes.bsky.social

🚨🛡️ Jailbreak Defense in a Narrow Domain 🛡️🚨

Jailbreaking is easy. Defending is hard. Might defending against a single, narrow behavior be easier?

Even in this focused setting, all defenses fail 😱 arxiv.org/abs/2412.02159

Appearing at @AdvMLFrontiers (Oral) & @solarneurips #NeurIPS2024

Jailbreak Defense in a Narrow Domain: Limitations of Existing...

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of...

arxiv.org

December 6, 2024 at 5:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news