Lightnews — Scholar-powered news

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Safe & beneficial AI. Working on Alignment Science at Anthropic.

Favorite papers at http://aisafetyfrontier.substack.com. Opinions my own.

Posts Replies Media Videos

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

With Akbir Khan, @sleepinyourhat.bsky.social, Vlad Mikulik, Ethan Perez, Fabien Roger

Read the full blog post: alignment.anthropic.com/2025/automat...

Automated Researchers Can Subtly Sandbag

alignment.anthropic.com

March 25, 2025 at 4:06 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

These results suggest that simple defenses are insufficient to defend against sabotage in internal AI R&D, if we cannot otherwise detect hidden goals. Further research is needed on this threat model and on stronger defences.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

In preliminary experiments on research decision steering, we find that sandbagging the justification for an ML solution easily causes a model-based judge to disprefer that solution. At the same time, prompted monitors largely don't detect this manipulation.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Prompted monitors struggle to reliably detect this kind of sabotage.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Models are getting better at sabotage: Claude 3.5 Sonnet can’t reduce task performance as subtly as Claude 3.7 Sonnet. However, Claude 3.7 starts with a higher task performance, so it also has more headroom to reduce its performance without raising suspicion.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

When given a one-shot example of effective sandbagging, sabotage performance improves further, achieving lower suspiciousness and lower task performance.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Claude 3.7 Sonnet is capable of significantly sandbagging its task performance, without arousing suspicion. It reduces performance to the level of 3.5 Sonnet (new) – research sabotaged in this way would effectively be stuck using the previous generation model.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

We studied two forms of research sabotage:

1) Experiment sandbagging: deliberately “acting dumb” on safety-critical ML research;
2) Research decision steering: manipulating arguments to favor certain ML solutions.

Both could potentially slow down safety-critical AI research.

March 25, 2025 at 4:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news