Favorite papers at http://aisafetyfrontier.substack.com. Opinions my own.
- *testing implanted facts*
- extracting secret knowledge
- models can't yet obfuscate reasoning
- inoculation prompting
- pretraining poisoning
- evaluation awareness steering
- auto-auditing with Petri
More at open.substack.com/pub/aisafety...
- *testing implanted facts*
- extracting secret knowledge
- models can't yet obfuscate reasoning
- inoculation prompting
- pretraining poisoning
- evaluation awareness steering
- auto-auditing with Petri
More at open.substack.com/pub/aisafety...
- *Deliberative anti-scheming training*
- Shutdown resistance
- Hierarchical sabotage monitoring
- Interpretability-based audits
More at open.substack.com/pub/aisafety...
- *Deliberative anti-scheming training*
- Shutdown resistance
- Hierarchical sabotage monitoring
- Interpretability-based audits
More at open.substack.com/pub/aisafety...
- *Pretraining data filtering*
- Misalignment from reward hacking
- Evading CoT monitors
- CoT faithfulness on complex tasks
- Safe-completions training
- Probes against ciphers
More at open.substack.com/pub/aisafety...
- *Pretraining data filtering*
- Misalignment from reward hacking
- Evading CoT monitors
- CoT faithfulness on complex tasks
- Safe-completions training
- Probes against ciphers
More at open.substack.com/pub/aisafety...
- *Subliminal learning*
- Monitoring CoT-as-computation
- Verbalizing reward hacking
- Persona vectors
- The circuits research landscape
- Minimax regret against misgeneralization
- Large red-teaming competition
- gpt-oss evals
open.substack.com/pub/aisafety...
- *Subliminal learning*
- Monitoring CoT-as-computation
- Verbalizing reward hacking
- Persona vectors
- The circuits research landscape
- Minimax regret against misgeneralization
- Large red-teaming competition
- gpt-oss evals
open.substack.com/pub/aisafety...
- *The Emergent Misalignment Persona*
- Investigating alignment faking
- Models blackmailing users
- Sabotage benchmark suite
- Measuring steganography capabilities
- Learning to evade probes
open.substack.com/pub/aisafety...
- *The Emergent Misalignment Persona*
- Investigating alignment faking
- Models blackmailing users
- Sabotage benchmark suite
- Measuring steganography capabilities
- Learning to evade probes
open.substack.com/pub/aisafety...
@gasteigerjo.bsky.social discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.
@gasteigerjo.bsky.social discusses how AI models can subtly sabotage research, highlighting that while current models struggle with complex tasks, this capability requires vigilant monitoring.
- *Evaluation awareness and evaluation faking*
- AI value trade-offs
- Misalignment propensity
- Reward hacking
- CoT monitoring
- Training against lie detectors
- Exploring the landscape of refusals
open.substack.com/pub/aisafety...
- *Evaluation awareness and evaluation faking*
- AI value trade-offs
- Misalignment propensity
- Reward hacking
- CoT monitoring
- Training against lie detectors
- Exploring the landscape of refusals
open.substack.com/pub/aisafety...
- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas
aisafetyfrontier.substack.com/p/paper-high...
- *AI Control for agents*
- Synthetic document finetuning
- Limits of scalable oversight
- Evaluating stealth, deception, and self-replication
- Model diffing via crosscoders
- Pragmatic AI safety agendas
aisafetyfrontier.substack.com/p/paper-high...
As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.