gleave.me
@gleave.me
With SOTA defenses LLMs can be difficult even for experts to exploit. Yet developers often compromise on defenses to retain performance (e.g. low-latency). This paper shows how these compromises can be used to break models – and how to securely implement defenses.
far.ai FAR.AI @far.ai · Jul 2
1/
"Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
July 2, 2025 at 6:17 PM
So many great talks from the Singapore Alignment Workshop -- I look forward to catching up on those that I missed in person!
far.ai FAR.AI @far.ai · Jun 11
📺 More Singapore Alignment Workshop talks are live!
Learn from @furongh.bsky.social @cassidylaidlaw.bsky.social + Tianwei Zhang, Jiaming Ji, Animesh Mukherjee, Robin Staes-Polet, Weiyan Shi, Huiqi Deng, Yinpeng Dong, Xiaoyuan Yi, Pin-Yu Chen & Baoyuan Wu
🔗👇
June 11, 2025 at 7:09 PM
As I say in the video, innovation vs safety is a false dichotomy -- do check out great ideas from our speakers for how innovation can enable effective policy in the video 👇 and initial talk recordings!
far.ai FAR.AI @far.ai · Jun 4
🎬 Technical Innovations for AI Policy wrapped this weekend with 200+ attendees and 30+ great speakers; highlights in video below 👇 We're also releasing our first 4 talk recordings from @alexbores.nyc, @hlntnr.bsky.social, Mark Beall and Brad Carson: buff.ly/4bV1yH4
June 4, 2025 at 1:08 PM
AI control is one of the most exciting new research directions; excited to have the videos from ControlConf, the world's first control-specific conference. Tons of great material both intros & diving into specific areas!
far.ai FAR.AI @far.ai · May 5
🎥 ControlConf videos are live! Hear from @bshlgrs.bsky.social @gasteigerjo.bsky.social
@sebfar.bsky.social, Ryan Greenblatt, Rohin Shah, Neel Nanda, Fabien Roger, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly + more on how to keep advanced AI systems under control. Blog + playlist below.👇
May 6, 2025 at 7:58 PM
Reposted
AI security needs more than just testing, it needs guarantees.
Evan Miyazono calls for broader adoption of formal proofs, suggesting a new paradigm where AI produces code to meet human specifications.
April 30, 2025 at 3:32 PM
Had a great time at the Singapore Alignment Workshop earlier this week -- fantastic start to the ICLR week! My only complaint is I missed many of the excellent talks because I was having so many interesting conversations. Looking forward to the videos to catch up!
far.ai FAR.AI @far.ai · Apr 23
Across continents and institutions, the Singapore Alignment Workshop brought top minds together on frontier AI safety.

We had 30 talks covering topics from superintelligence risk (@yoshuabengio) to safety pre-training (@zicokolter), emergent misalignment (Owain Evans), & more—thank you to all!
April 25, 2025 at 3:16 AM
Reposted
ControlConf 2025 Day 2 delivered!
From robust evals to security R&D & moral patienthood, we covered the edge of AI control theory and practice. Thanks to Ryan Greenblatt, Rohin Shah, Alex Mallen, Stephen McAleer, Tomek Korbak, Steve Kelly & others for their insights.
March 28, 2025 at 6:54 PM
My biggest complaint with the AI Security Forum was too much great content across the three tracks. Looking forward to catching up on the talks I missed with the videos 👇
far.ai FAR.AI @far.ai · Mar 19
🎥 Paris AI Security Forum videos are released! Check out talks by @yoshuabengio.bsky.social @bshlgrs.bsky.social @zicokolter.bsky.social Sella Nevo, Dan Lahav, davidad, Xander Davies, Alex Robey, Mahmoud Ghanem, Christopher Painter, Dawn Song with more coming! Blog recap & links. 👇
March 20, 2025 at 4:02 PM
Excited to meet others working on or interested in alignment at the Alignment Workshop Open Social before ICLR!
far.ai FAR.AI @far.ai · Mar 12
💬 Big ideas, great people & a shared passion in AI alignment!
Attending #ICLR2025? Join us for the Alignment Workshop: Open Social to meet others engaged in AI safety in a relaxed & welcoming space.

📍 Novotel Singapore on Stevens
⏰ Wed, April 23 from 19:00–22:00

RSVP 👇
March 13, 2025 at 12:07 AM
Excited to see people before ICLR at Alignment Workshop Singapore!
far.ai FAR.AI @far.ai · Mar 11
Heading to #ICLR2025 in Singapore? Apply to the Alignment Workshop on April 23 to tackle the biggest challenges in making advanced AI systems safe.

Expect deep discussions, technical insights & a chance to connect with other researchers shaping the future of AI alignment. 👇
March 11, 2025 at 11:05 PM
Humans sometimes cheat at exams -- might AIs do the same? Unique challenge to evaluating intelligent systems.
far.ai FAR.AI @far.ai · Mar 10
"How do we know AI models aren’t secretly sabotaging human oversight?"

David Duvenaud chats with @dfrsrchtwts on the AXRP episode about evaluating sabotage risks in frontier AI models and what happens after AGI.
38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier mod...
buff.ly
March 10, 2025 at 6:48 PM
AI agents can start VMs, buy things, send e-mails, etc. AI control is a promising way to prevent harmful agent actions -- whether by accident, due to adversarial attack, or the systems themselves trying to subvert controls. Apply to the world's first control conference 👇
far.ai FAR.AI @far.ai · Mar 3
🚦Announcing ControlConf:
The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls.
📅 March 27-28, 2025 in London.
March 3, 2025 at 9:44 PM
Since joining FAR.AI in June, Lindsay has delivered amazing events like the Alignment Workshop Bay Area and Paris Security Forum. Welcome to the team!
February 24, 2025 at 8:24 PM
Evaluations are key to understanding AI capabilities and risks -- but which ones matter? Enjoyed Soroush's talk exploring these issues!
far.ai FAR.AI @far.ai · Feb 17
A lot of evaluations don’t actually show us anything important or change the existing beliefs.

Soroush Pour of Harmony Intelligence shares lessons from real-world AI risk evals: prioritize impactful insights, scalable engineering & alignment with partner needs.
February 17, 2025 at 8:05 PM
Excited to have Annie join our team, and help produce a 200-person event in her first month! We're growing across operations and technical roles -- check out opportunities 👇
far.ai FAR.AI @far.ai · Feb 14
🥁 Welcome Annie Greenwood as Events Project Manager!
Annie jumped right into the rhythm, literally! On day #1, she grabbed maracas at our retreat and hasn’t missed a beat since. Just 3 weeks in, she helped run the Paris AI Security Forum for 200 people. Excited for what’s next!
February 14, 2025 at 7:58 PM
I had great conversations at the AI Security Forum -- it's exciting to see people from cybersec, hardware root of trust, and AI come together to come up with creative solutions to boost AI security.
far.ai FAR.AI @far.ai · Feb 10
The AI Security Forum in Paris gathered top minds to tackle 1 urgent mission: securing AI models to avert catastrophes and fast-track scientific and economic breakthroughs. Huge thanks to @yoshuabengio.bsky.social, Sella Nevo, Dan Lahav, davidad, Xander Davies plus all who spoke & joined!
February 10, 2025 at 8:42 PM
Formal verification has a lot of exciting applications, especially in the age of LLMs: e.g. can LLMs output programs with proofs of correctness? However formally verifying neural network behavior in general seems intractable -- enjoyed Zac's talk on limitations.
far.ai FAR.AI @far.ai · Jan 30
“It's important to avoid over-claiming about how much [formal verification] could solve our problems.”

Zac Hatfield-Dodds explains why we need to balance verification methods with practical safety work.
January 31, 2025 at 9:58 AM
Many eyes on code secures critical open-source code -- we similarly need independent scrutiny of AI models to catch issues and have trust in the models. A safe harbor for evaluation as proposed by @shaynelongpre.bsky.social could enable an independent testing ecosystem.
far.ai FAR.AI @far.ai · Jan 27
A safe harbor for AI evaluation is a voluntary commitment to protect good faith research, reducing fear of repercussions.

Shayne Longpre calls for a secure framework with strict rules to enable responsible testing of AI systems—protecting innovation while preventing harm.
January 28, 2025 at 5:07 PM
Deceptive alignment is one of the more pernicious safety risks. I used to find it far fetched but LLMs are very good at persuasion so have the capability -- it's just a question of whether it's incentivized during training. Great to see work towards detecting deception!
far.ai FAR.AI @far.ai · Jan 23
“Maybe it's pretending to be helpful in training and then ‘takes the mask off’.”

Jacob Hilton draws a striking analogy between backdoors and deceptive alignment, examining how ‘secret triggers’ in AI could lead to hidden, harmful behavior.
January 23, 2025 at 9:45 PM
Gradient routing can influence where a neural network learns a particular skill -- useful for interpretability and control.
far.ai FAR.AI @far.ai · Jan 20
“Our strategy: Don’t suppress the capability. Let the model learn it, but control where.“

Alex Turner explores how Gradient Routing can localize learned skills within neural networks to specific regions, limiting potential misuse.
January 21, 2025 at 6:14 AM
Open-weight models are more flexible, decentralize power in AI and power research. But their flexibility also allows them to be misused: e.g. the surge in gen AI phishing. Tamper-resistant safeguards could give best of both worlds! Challenging but very important research area.
far.ai FAR.AI @far.ai · Jan 16
Advanced open-weight models without safeguards risk misuse, as releasing them online gives access to everyone, including terrorist groups.

Mantas Mazeika explains how tamper-resistant safeguards in open-weight LLMs could curb misuse without harming performance.
January 17, 2025 at 8:03 PM
Reposted
"Balancing helpfulness and safety is a real challenge for AI agents in mobile environments."

Kimin Lee from KAIST introduces MobileSafetyBench—a tool for testing AI safety in mobile devices.
January 9, 2025 at 4:30 PM
Reposted
🛡️🔐 Paris AI Security Forum 2025 on Feb 9! 🇫🇷
Join 150+ engineers, researchers & policymakers from leading labs, academia & government to discuss challenges in AI system security & safety. Hear from speakers at RAND, Palo Alto Networks, Dreadnode, Pattern Labs & more. 🔗👇
January 10, 2025 at 9:16 PM
Reposted
“We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.”

Chirag Agarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.
January 13, 2025 at 4:30 PM
Looking forward to the AI Security forum on Feb 9th -- just before Paris AI Action Summit and two days after IASEAI. AI opens up new classes of security vulnerabilities and AI model weights may be one of the most valuable assets making AI infosec timely & fascinating!
far.ai FAR.AI @far.ai · Jan 10
🛡️🔐 Paris AI Security Forum 2025 on Feb 9! 🇫🇷
Join 150+ engineers, researchers & policymakers from leading labs, academia & government to discuss challenges in AI system security & safety. Hear from speakers at RAND, Palo Alto Networks, Dreadnode, Pattern Labs & more. 🔗👇
January 11, 2025 at 12:26 AM