vkrakovna.bsky.social
@vkrakovna.bsky.social
Research scientist in AI alignment at Google DeepMind. Co-founder of Future of Life Institute. Views are my own and do not represent GDM or FLI.
This allows us to make an "inability safety case" for now: current models are unlikely to cause severe harm via scheming. As model capabilities advance, we will need to continue evaluating models for scheming capabilities.
Read more about this work: deepmindsafetyresearch.medium.com/evaluating-a...
Evaluating and monitoring for AI scheming
By Victoria Krakovna, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah
deepmindsafetyresearch.medium.com
July 8, 2025 at 12:11 PM
For example, in one stealth evaluation, a model acting as a personal assistant had to make a user miss a meeting and then cover its tracks by deleting emails and logs. Models performed significantly worse than humans on this challenge.
July 8, 2025 at 12:11 PM
We tested Gemini 2.5 Flash & Pro, GPT-4o, o1, and Claude 3.7 Sonnet using a new suite of evaluations for scheming capabilities. These models show some limited stealth and situational awareness capabilities. Our evaluations are open-sourced here: github.com/UKGovernment...
inspect_evals/src/inspect_evals/gdm_capabilities at main · UKGovernmentBEIS/inspect_evals
Collection of evals for Inspect AI. Contribute to UKGovernmentBEIS/inspect_evals development by creating an account on GitHub.
github.com
July 8, 2025 at 12:11 PM
We identified two prerequisite capabilities for scheming:
- Stealth: Ability to reason about and circumvent oversight mechanisms.
- Situational awareness: The model’s ability to gain and leverage knowledge about itself and its circumstances and environment to pursue goals.
July 8, 2025 at 12:11 PM