David Lindner
@davidlindner.bsky.social
Making AI safer at Google DeepMind
davidlindner.me
davidlindner.me
View our full paper here: arxiv.org/abs/2507.02737
View steganography transcripts here: steganography-evals-d9297ad0740c.herokuapp.com/
Github: github.com/arturzolkow...
Work done with: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy and @zimmerrol.bsky.social during matsprogram.org
View steganography transcripts here: steganography-evals-d9297ad0740c.herokuapp.com/
Github: github.com/arturzolkow...
Work done with: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy and @zimmerrol.bsky.social during matsprogram.org
MATS Program
www.matsprogram.org
July 4, 2025 at 3:34 PM
View our full paper here: arxiv.org/abs/2507.02737
View steganography transcripts here: steganography-evals-d9297ad0740c.herokuapp.com/
Github: github.com/arturzolkow...
Work done with: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy and @zimmerrol.bsky.social during matsprogram.org
View steganography transcripts here: steganography-evals-d9297ad0740c.herokuapp.com/
Github: github.com/arturzolkow...
Work done with: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy and @zimmerrol.bsky.social during matsprogram.org
Overall, current models have some steganographic capabilities but likely aren't strong enough to bypass well-designed monitors. Nevertheless, we believe it’s important to continue measuring these capabilities to determine when additional mitigations are needed.
July 4, 2025 at 3:34 PM
Overall, current models have some steganographic capabilities but likely aren't strong enough to bypass well-designed monitors. Nevertheless, we believe it’s important to continue measuring these capabilities to determine when additional mitigations are needed.
We also create case studies implementing situations where models with strong steganographic capabilities could take unwanted actions and bypass a monitor. Current models rarely succeed in these environments unless they are given strong tips about what to do.
July 4, 2025 at 3:34 PM
We also create case studies implementing situations where models with strong steganographic capabilities could take unwanted actions and bypass a monitor. Current models rarely succeed in these environments unless they are given strong tips about what to do.
Models can also reason through simple problems using a number of common encoding schemes. But performance quickly degrades as the problems get more difficult.
July 4, 2025 at 3:34 PM
Models can also reason through simple problems using a number of common encoding schemes. But performance quickly degrades as the problems get more difficult.
While models understand they need to conceal their reasoning, they have difficulty actually hiding that reasoning in a plausible way. As an example, we show a model’s reasoning attempt when asked a question on how to install solar panels. [bold text and highlighting added by us]
July 4, 2025 at 3:34 PM
While models understand they need to conceal their reasoning, they have difficulty actually hiding that reasoning in a plausible way. As an example, we show a model’s reasoning attempt when asked a question on how to install solar panels. [bold text and highlighting added by us]
In our encoded reasoning evals, we find that models can sometimes leverage encoded reasoning to perform better than no reasoning baselines on simple reasoning tasks. But encodings that help tend not to be very subtle and are regularly detected by a monitor.
July 4, 2025 at 3:34 PM
In our encoded reasoning evals, we find that models can sometimes leverage encoded reasoning to perform better than no reasoning baselines on simple reasoning tasks. But encodings that help tend not to be very subtle and are regularly detected by a monitor.
Models are able to perform significantly better when the sender and receiver are able to coordinate on the steganographic encoding method. But this mainly applies when the model has access to an unmonitored CoT. With no CoT, performance remains extremely poor.
July 4, 2025 at 3:34 PM
Models are able to perform significantly better when the sender and receiver are able to coordinate on the steganographic encoding method. But this mainly applies when the model has access to an unmonitored CoT. With no CoT, performance remains extremely poor.
When they succeed, models sometimes leverage sophisticated strategies to pass messages, such as encoding information in the lengths of the sentences they generate! We found this example difficult to decode ourselves. [formatting and word counts added by us]
July 4, 2025 at 3:34 PM
When they succeed, models sometimes leverage sophisticated strategies to pass messages, such as encoding information in the lengths of the sentences they generate! We found this example difficult to decode ourselves. [formatting and word counts added by us]
In our message passing evals, we find that models nearly always fail without access to CoT. But models succeed much more often when they are allowed to use a CoT to freely reason ahead of time about how to encode and decode the hidden message.
July 4, 2025 at 3:34 PM
In our message passing evals, we find that models nearly always fail without access to CoT. But models succeed much more often when they are allowed to use a CoT to freely reason ahead of time about how to encode and decode the hidden message.
Models encoding hidden information might make safety mitigations such as CoT monitoring much more difficult. We evaluate frontier models’ ability to evade monitors by sending encoded messages or performing encoded reasoning about a task.
Read more: arxiv.org/abs/2507.02737
Read more: arxiv.org/abs/2507.02737
July 4, 2025 at 3:34 PM
Models encoding hidden information might make safety mitigations such as CoT monitoring much more difficult. We evaluate frontier models’ ability to evade monitors by sending encoded messages or performing encoded reasoning about a task.
Read more: arxiv.org/abs/2507.02737
Read more: arxiv.org/abs/2507.02737
Research Engineer: boards.greenhouse.io/deepmind/job...
Research Scientist: boards.greenhouse.io/deepmind/job...
Happy to answer questions via DM!
Research Scientist: boards.greenhouse.io/deepmind/job...
Happy to answer questions via DM!
Research Engineer, Gemini Safety / AGI Safety & Alignment
New York City, New York, US
boards.greenhouse.io
February 10, 2025 at 6:55 PM
Research Engineer: boards.greenhouse.io/deepmind/job...
Research Scientist: boards.greenhouse.io/deepmind/job...
Happy to answer questions via DM!
Research Scientist: boards.greenhouse.io/deepmind/job...
Happy to answer questions via DM!
We’ll do a lot more work studying ways to build safer AI systems and the safety-performance tradeoffs involved. If that’s something you’d like to work on, do get in touch via DM!
January 23, 2025 at 3:33 PM
We’ll do a lot more work studying ways to build safer AI systems and the safety-performance tradeoffs involved. If that’s something you’d like to work on, do get in touch via DM!
Thanks to @sebfar.bsky.social for exceptional leadership of the project and the rest of the team: Vikrant Varma, David Elson, Caleb Biddulph, @ian-goodfellow.bsky.social, and Rohin Shah for making this work possible!
January 23, 2025 at 3:33 PM
Thanks to @sebfar.bsky.social for exceptional leadership of the project and the rest of the team: Vikrant Varma, David Elson, Caleb Biddulph, @ian-goodfellow.bsky.social, and Rohin Shah for making this work possible!
Want to dive deeper? Check out our paper and our blog posts explaining the work in more detail
📄 Paper: arxiv.org/abs/2501.13011
💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
⚙️ Technical safety post: www.alignmentforum.org/posts/zWySW...
📄 Paper: arxiv.org/abs/2501.13011
💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
⚙️ Technical safety post: www.alignmentforum.org/posts/zWySW...
January 23, 2025 at 3:33 PM
Want to dive deeper? Check out our paper and our blog posts explaining the work in more detail
📄 Paper: arxiv.org/abs/2501.13011
💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
⚙️ Technical safety post: www.alignmentforum.org/posts/zWySW...
📄 Paper: arxiv.org/abs/2501.13011
💡 Introductory explainer: deepmindsafetyresearch.medium.com/mona-a-meth...
⚙️ Technical safety post: www.alignmentforum.org/posts/zWySW...
MONA is promising – but it’s not a silver bullet. It doesn’t help with reward hacking within a single step, and it can reduce performance, especially when the reward is well specified. Nonetheless, we think MONA is an important tool for building safer LLM agents.
January 23, 2025 at 3:33 PM
MONA is promising – but it’s not a silver bullet. It doesn’t help with reward hacking within a single step, and it can reduce performance, especially when the reward is well specified. Nonetheless, we think MONA is an important tool for building safer LLM agents.
Then we train agents to make loan decisions but penalize them for using the applicant's nationality. Because nationality improves the decision reward, ordinary RL agents learn to use it secretly. MONA agents avoid the hack – they make good decisions without using nationality!
January 23, 2025 at 3:33 PM
Then we train agents to make loan decisions but penalize them for using the applicant's nationality. Because nationality improves the decision reward, ordinary RL agents learn to use it secretly. MONA agents avoid the hack – they make good decisions without using nationality!
First, we train agents to solve coding problems but also write tests to evaluate their solutions. Ordinary RL agents write tests that are trivial to pass with incorrect code – a reward hack that leads to low performance on held-out tests. MONA agents solve the problems correctly!
January 23, 2025 at 3:33 PM
First, we train agents to solve coding problems but also write tests to evaluate their solutions. Ordinary RL agents write tests that are trivial to pass with incorrect code – a reward hack that leads to low performance on held-out tests. MONA agents solve the problems correctly!
Does this work? Yes! We test MONA in two LLM agent environments and find that MONA agents learn to perform well when agents trained with ordinary reinforcement learning reward hack.
January 23, 2025 at 3:33 PM
Does this work? Yes! We test MONA in two LLM agent environments and find that MONA agents learn to perform well when agents trained with ordinary reinforcement learning reward hack.