Johannes Gasteiger🔸
gasteigerjo.bsky.social
Johannes Gasteiger🔸
@gasteigerjo.bsky.social
Safe & beneficial AI. Working on Alignment Science at Anthropic.

Favorite papers at http://aisafetyfrontier.substack.com. Opinions my own.
Paper Highlights, April '25:

- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas

aisafetyfrontier.substack.com/p/paper-high...
May 6, 2025 at 2:25 PM
In preliminary experiments on research decision steering, we find that sandbagging the justification for an ML solution easily causes a model-based judge to disprefer that solution. At the same time, prompted monitors largely don't detect this manipulation.
March 25, 2025 at 4:03 PM
Prompted monitors struggle to reliably detect this kind of sabotage.
March 25, 2025 at 4:03 PM
Models are getting better at sabotage: Claude 3.5 Sonnet can’t reduce task performance as subtly as Claude 3.7 Sonnet. However, Claude 3.7 starts with a higher task performance, so it also has more headroom to reduce its performance without raising suspicion.
March 25, 2025 at 4:03 PM
When given a one-shot example of effective sandbagging, sabotage performance improves further, achieving lower suspiciousness and lower task performance.
March 25, 2025 at 4:03 PM
New Anthropic blog post: Subtle sabotage in automated researchers.

As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
March 25, 2025 at 4:03 PM