Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
Fine-tuning on dangerous knowledge leads to disproportionately fast recovery of hazardous capabilities (10 samples -> >60% of capabilities regained).
↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.
✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
↗️ Similar to safety, unlearning relies on specific directions in the residual stream that can be ablated.
✂️ We can prune neurons responsible for “obfuscating” dangerous knowledge.
Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.
Here's what we found👇
Our results show that current unlearning methods for AI safety only obfuscate dangerous knowledge, just like standard safety training.
Here's what we found👇