Lightnews — Scholar-powered news

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Safe & beneficial AI. Working on Alignment Science at Anthropic.

Favorite papers at http://aisafetyfrontier.substack.com. Opinions my own.

Posts Replies Media Videos

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

My AI Safety Paper Highlights of October 2025:

- *testing implanted facts*
- extracting secret knowledge
- models can't yet obfuscate reasoning
- inoculation prompting
- pretraining poisoning
- evaluation awareness steering
- auto-auditing with Petri

More at open.substack.com/pub/aisafety...

Paper Highlights of October 2025

Testing implanted facts, extracting secret knowledge, models can't yet obfuscate, inoculation prompting, pretraining poisoning, evaluation awareness steering, and Petri

open.substack.com

November 5, 2025 at 1:35 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

My AI Safety Paper Highlights for September '25:

- *Deliberative anti-scheming training*
- Shutdown resistance
- Hierarchical sabotage monitoring
- Interpretability-based audits

More at open.substack.com/pub/aisafety...

Paper Highlights, September '25

Deliberative anti-scheming training, shutdown resistance, hierarchical monitoring, and interpretability-based audits

open.substack.com

October 1, 2025 at 4:22 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

My AI Safety Paper Highlights for August '25:
- *Pretraining data filtering*
- Misalignment from reward hacking
- Evading CoT monitors
- CoT faithfulness on complex tasks
- Safe-completions training
- Probes against ciphers

More at open.substack.com/pub/aisafety...

Paper Highlights, August '25

Pretraining data filtering, misalignment from reward hacking, evading CoT monitors, CoT faithfulness on complex tasks, safe-completions training, and probes against ciphers

open.substack.com

September 2, 2025 at 8:24 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Paper Highlights, July '25:

- *Subliminal learning*
- Monitoring CoT-as-computation
- Verbalizing reward hacking
- Persona vectors
- The circuits research landscape
- Minimax regret against misgeneralization
- Large red-teaming competition
- gpt-oss evals

open.substack.com/pub/aisafety...

Paper Highlights, July '25

Subliminal learning, monitoring CoT-as-computation, verbalizing reward hacking, persona vectors, the circuits research landscape, minimax regret, a red-teaming competition, and gpt-oss evals

open.substack.com

August 10, 2025 at 12:40 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

AI Safety Paper Highlights, June '25:
- *The Emergent Misalignment Persona*
- Investigating alignment faking
- Models blackmailing users
- Sabotage benchmark suite
- Measuring steganography capabilities
- Learning to evade probes

open.substack.com/pub/aisafety...

Paper Highlights, June '25

Emergent misalignment persona, investigating alignment faking, models blackmailing users, sabotage benchmarks, steganography capabilities, and evading probes

open.substack.com

July 7, 2025 at 6:14 PM

Reposted by Johannes Gasteiger🔸

FAR.AI

@far.ai

“If an automated researcher were malicious, what could it try to achieve?”

@gasteigerjo.bsky.social discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.

May 20, 2025 at 3:32 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

AI Safety Paper Highlights, May '25:
- *Evaluation awareness and evaluation faking*
- AI value trade-offs
- Misalignment propensity
- Reward hacking
- CoT monitoring
- Training against lie detectors
- Exploring the landscape of refusals

open.substack.com/pub/aisafety...

Paper Highlights, May '25

Evaluation awareness and faking, AI value trade-offs, misalignment propensity, reward hacking, CoT monitoring, training against lie detectors, and exploring the landscape of refusals

open.substack.com

June 17, 2025 at 5:22 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

Paper Highlights, April '25:

- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas

aisafetyfrontier.substack.com/p/paper-high...

May 6, 2025 at 2:25 PM

Johannes Gasteiger🔸

@gasteigerjo.bsky.social

New Anthropic blog post: Subtle sabotage in automated researchers.

As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.

March 25, 2025 at 4:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news