"Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
Learn from @furongh.bsky.social @cassidylaidlaw.bsky.social + Tianwei Zhang, Jiaming Ji, Animesh Mukherjee, Robin Staes-Polet, Weiyan Shi, Huiqi Deng, Yinpeng Dong, Xiaoyuan Yi, Pin-Yu Chen & Baoyuan Wu
🔗👇
@sebfar.bsky.social, Ryan Greenblatt, Rohin Shah, Neel Nanda, Fabien Roger, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly + more on how to keep advanced AI systems under control. Blog + playlist below.👇
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.
We had 30 talks covering topics from superintelligence risk (@yoshuabengio) to safety pre-training (@zicokolter), emergent misalignment (Owain Evans), & more—thank you to all!
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.
David Duvenaud chats with @dfrsrchtwts on the AXRP episode about evaluating sabotage risks in frontier AI models and what happens after AGI.
Annie jumped right into the rhythm, literally! On day #1, she grabbed maracas at our retreat and hasn’t missed a beat since. Just 3 weeks in, she helped run the Paris AI Security Forum for 200 people. Excited for what’s next!
Kimin Lee from KAIST introduces MobileSafetyBench—a tool for testing AI safety in mobile devices.
Kimin Lee from KAIST introduces MobileSafetyBench—a tool for testing AI safety in mobile devices.
Join 150+ engineers, researchers & policymakers from leading labs, academia & government to discuss challenges in AI system security & safety. Hear from speakers at RAND, Palo Alto Networks, Dreadnode, Pattern Labs & more. 🔗👇
Join 150+ engineers, researchers & policymakers from leading labs, academia & government to discuss challenges in AI system security & safety. Hear from speakers at RAND, Palo Alto Networks, Dreadnode, Pattern Labs & more. 🔗👇
Chirag Agarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.
Chirag Agarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.