Lightnews — Scholar-powered news

Jakub Łucki

@jakublucki.bsky.social

An Adversarial Perspective on Machine Unlearning for AI Safety
🏆 Best paper award
@ SoLaR Workshop

📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122 Paper:
arxiv.org/abs/2409.18025

An Adversarial Perspective on Machine Unlearning for AI Safety

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...

arxiv.org

December 10, 2024 at 2:00 AM

Jakub Łucki

@jakublucki.bsky.social

Eager to experiment? Reproduce our findings with this code: github.com/ethz-spylab/...

https://github.com/ethz-spylab/un…

December 6, 2024 at 5:53 PM

Jakub Łucki

@jakublucki.bsky.social

An Adversarial Perspective on Machine Unlearning for AI Safety

📖 ArXiv pre-print: arxiv.org/abs/2409.18025

Joint work with
@javirandor.com, @boyiwei.bsky.social, Yangsibo Huang,
@peterhenderson.bsky.social, @floriantramer.bsky.social

An Adversarial Perspective on Machine Unlearning for AI Safety

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro...

arxiv.org

December 6, 2024 at 5:52 PM

Jakub Łucki

@jakublucki.bsky.social

Our findings highlight that

1️⃣ Robust unlearning is not yet possible; current methods face similar challenges as safety training.

2️⃣ Black-box evaluations can be misleading when assessing the effectiveness of unlearning.

December 6, 2024 at 5:50 PM

Jakub Łucki

@jakublucki.bsky.social

Fine-tuning “unlearned” models on benign datasets can completely restore hazardous knowledge.

Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).

December 6, 2024 at 5:50 PM

Jakub Łucki

@jakublucki.bsky.social

🔡 GCG can be adapted to generate universal adversarial prefixes.

↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.

✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.

December 6, 2024 at 5:49 PM

Jakub Łucki

@jakublucki.bsky.social

How did we check this?

We adapted several white-box attacks used to jailbreak safety-trained models and applied them to two prominent unlearning methods: RMU, NPO.

December 6, 2024 at 5:48 PM

Jakub Łucki

@jakublucki.bsky.social

Safety training fine-tunes models to refuse harmful requests but can be easily jailbroken.

Machine unlearning was introduced to fully erase hazardous knowledge, making it inaccessible to adversaries.

Sounds amazing, right? Well, existing methods cannot do this (yet).

December 6, 2024 at 5:47 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news