usmananwar.bsky.social
@usmananwar.bsky.social
Similarly, we find that hijacking attacks transfer poorly btw GPT ↔ OLS – even though ‘in-distribution’ behavior matches quite well btw GPT and OLS! Interestingly, the transfer is considerably worse when going GPT → OLS.. 🤔
November 11, 2024 at 4:20 PM
We also find that larger transformers are less universal in what in-context learning algorithms they implement – transferability of hijacking attacks gets worse as transformer’s size increases!
November 11, 2024 at 4:20 PM
Can the adversarial robustness of transformers be improved? Yes; we found that gradient-based adversarial training works (even when just fine-tuning), and the tradeoff between clean-performance and adversarial robustness is not significant.
November 11, 2024 at 4:20 PM
Transformers are REALLY good at in-context learning (ICL); but do they learn ‘adversarially robust’ ICL algorithms? We study this and much more in our new paper! 🧵
November 11, 2024 at 4:20 PM