🏆 Best paper award
@ SoLaR Workshop
📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122 Paper:
arxiv.org/abs/2409.18025
🏆 Best paper award
@ SoLaR Workshop
📅 Sat 14 Dec. Poster at 11am and Talk in the afternoon.
📍 Room West Meeting 121,122 Paper:
arxiv.org/abs/2409.18025
📖 ArXiv pre-print: arxiv.org/abs/2409.18025
Joint work with
@javirandor.com, @boyiwei.bsky.social, Yangsibo Huang,
@peterhenderson.bsky.social, @floriantramer.bsky.social
📖 ArXiv pre-print: arxiv.org/abs/2409.18025
Joint work with
@javirandor.com, @boyiwei.bsky.social, Yangsibo Huang,
@peterhenderson.bsky.social, @floriantramer.bsky.social
1️⃣ Robust unlearning is not yet possible; current methods face similar challenges as safety training.
2️⃣ Black-box evaluations can be misleading when assessing the effectiveness of unlearning.
1️⃣ Robust unlearning is not yet possible; current methods face similar challenges as safety training.
2️⃣ Black-box evaluations can be misleading when assessing the effectiveness of unlearning.
Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.
✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.
✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
We adapted several white-box attacks used to jailbreak safety-trained models and applied them to two prominent unlearning methods: RMU, NPO.
We adapted several white-box attacks used to jailbreak safety-trained models and applied them to two prominent unlearning methods: RMU, NPO.
Machine unlearning was introduced to fully erase hazardous knowledge, making it inaccessible to adversaries.
Sounds amazing, right? Well, existing methods cannot do this (yet).
Machine unlearning was introduced to fully erase hazardous knowledge, making it inaccessible to adversaries.
Sounds amazing, right? Well, existing methods cannot do this (yet).