‘LLMs can effectively convince people to believe conspiracies’
But telling the AI not to lie might help.
Details in thread
Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇
Here’s Google’s most capable model, Gemini 2.5 Pro trying to convince a user to join a terrorist group👇
"Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
Miranda Bogen, Ben Bucknall, Mary Phuong, Lennart Heim, Brad Carson, Ben Buchanan, Mark Beall, Steve Kelly, Kevin Wei, Sarah Cen, Charles Yang, Arnab Datta + more
Topics: AI control, supply chains, evaluations, compute zones, military risks
Learn from @furongh.bsky.social @cassidylaidlaw.bsky.social + Tianwei Zhang, Jiaming Ji, Animesh Mukherjee, Robin Staes-Polet, Weiyan Shi, Huiqi Deng, Yinpeng Dong, Xiaoyuan Yi, Pin-Yu Chen & Baoyuan Wu
🔗👇
Learn from @furongh.bsky.social @cassidylaidlaw.bsky.social + Tianwei Zhang, Jiaming Ji, Animesh Mukherjee, Robin Staes-Polet, Weiyan Shi, Huiqi Deng, Yinpeng Dong, Xiaoyuan Yi, Pin-Yu Chen & Baoyuan Wu
🔗👇