Lightnews — Scholar-powered news

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Safe & beneficial AI. Working on Alignment Science at Anthropic.

Favorite papers at http://aisafetyfrontier.substack.com. Opinions my own.

Posts Replies Media Videos

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Paper Highlights, April '25:

- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas

aisafetyfrontier.substack.com/p/paper-high...

May 6, 2025 at 2:25 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

In preliminary experiments on research decision steering, we find that sandbagging the justification for an ML solution easily causes a model-based judge to disprefer that solution. At the same time, prompted monitors largely don't detect this manipulation.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Prompted monitors struggle to reliably detect this kind of sabotage.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Models are getting better at sabotage: Claude 3.5 Sonnet can’t reduce task performance as subtly as Claude 3.7 Sonnet. However, Claude 3.7 starts with a higher task performance, so it also has more headroom to reduce its performance without raising suspicion.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

When given a one-shot example of effective sandbagging, sabotage performance improves further, achieving lower suspiciousness and lower task performance.

March 25, 2025 at 4:03 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

New Anthropic blog post: Subtle sabotage in automated researchers.

As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.

March 25, 2025 at 4:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news