Johannes Gasteiger🔸
gasteigerjo.bsky.social
Johannes Gasteiger🔸
@gasteigerjo.bsky.social
Safe & beneficial AI. Working on Alignment Science at Anthropic.

Favorite papers at http://aisafetyfrontier.substack.com. Opinions my own.
With Akbir Khan, @sleepinyourhat.bsky.social, Vlad Mikulik, Ethan Perez, Fabien Roger

Read the full blog post: alignment.anthropic.com/2025/automat...
Automated Researchers Can Subtly Sandbag
alignment.anthropic.com
March 25, 2025 at 4:06 PM
These results suggest that simple defenses are insufficient to defend against sabotage in internal AI R&D, if we cannot otherwise detect hidden goals. Further research is needed on this threat model and on stronger defences.
March 25, 2025 at 4:03 PM
In preliminary experiments on research decision steering, we find that sandbagging the justification for an ML solution easily causes a model-based judge to disprefer that solution. At the same time, prompted monitors largely don't detect this manipulation.
March 25, 2025 at 4:03 PM
Prompted monitors struggle to reliably detect this kind of sabotage.
March 25, 2025 at 4:03 PM
Models are getting better at sabotage: Claude 3.5 Sonnet can’t reduce task performance as subtly as Claude 3.7 Sonnet. However, Claude 3.7 starts with a higher task performance, so it also has more headroom to reduce its performance without raising suspicion.
March 25, 2025 at 4:03 PM
When given a one-shot example of effective sandbagging, sabotage performance improves further, achieving lower suspiciousness and lower task performance.
March 25, 2025 at 4:03 PM
Claude 3.7 Sonnet is capable of significantly sandbagging its task performance, without arousing suspicion. It reduces performance to the level of 3.5 Sonnet (new) – research sabotaged in this way would effectively be stuck using the previous generation model.
March 25, 2025 at 4:03 PM
We studied two forms of research sabotage:

1) Experiment sandbagging: deliberately “acting dumb” on safety-critical ML research;
2) Research decision steering: manipulating arguments to favor certain ML solutions.

Both could potentially slow down safety-critical AI research.
March 25, 2025 at 4:03 PM