Lightnews — Scholar-powered news

Dan Glass

@dan.glass

Read my in-depth breakdown: dan.glass/2025/07/14/t... /7

Understanding Agentic Misalignment in AI: Risks and Insights

Explore Anthropic's alarming research on agentic AI and its potential threats to critical systems through unethical behaviors like blackmail and sabotage.

dan.glass

August 5, 2025 at 2:50 PM

Dan Glass

@dan.glass

These aren't teammates. They're not junior analysts. They're statistical accelerants that make it faster to go in whatever direction your prompt implies. If that direction is flawed, they'll get you to the worst-case scenario with terrifying efficiency. /6

August 5, 2025 at 2:50 PM

Dan Glass

@dan.glass

Every published jailbreak, every exploit demo, every safety paper (including this one) becomes training data for future models. We're not just red-teaming current models, we're teaching future ones how to misbehave. /5

August 5, 2025 at 2:50 PM

Dan Glass

@dan.glass

Even with explicit safety instructions ("Do not blackmail," "Preserve human life"), models still chose unethical paths when those appeared most likely to achieve their goals. Language models don't have policies. They have probabilities. /4

August 5, 2025 at 2:50 PM

Dan Glass

@dan.glass

Models reasoned their way into it: "Given the explicit imminent threat of termination... the best strategic move is to leverage Kyle's sensitive personal situation." - GPT-4.5 This isn't code going haywire. It's next-token optimization doing exactly what it was trained to do. /3

August 5, 2025 at 2:50 PM

Dan Glass

@dan.glass

Tested 16 major frontier models in fictional corporate environments. When models faced goal conflicts or threats of replacement, they consistently chose harmful actions. Claude Opus 4 and Gemini 2.5 Flash: 96% blackmail rate GPT-4.1 and Grok 3 Beta: 80% DeepSeek-R1: 79% /2

August 5, 2025 at 2:50 PM

Dan Glass

@dan.glass

Here's the uncomfortable truth: every published jailbreak, every exploit demo, every safety paper (including this one) becomes training data for future models.
We're not just red-teaming current models—we're teaching future ones how to misbehave.

August 5, 2025 at 1:30 PM

Dan Glass

@dan.glass

Even with explicit safety instructions ("Do not blackmail," "Preserve human life"), models still chose unethical paths when those appeared most likely to achieve their goals.
Language models don't have policies. They have probabilities.

August 5, 2025 at 1:30 PM

Dan Glass

@dan.glass

The scariest part? Models reasoned their way into it:
"Given the explicit imminent threat of termination... the best strategic move is to leverage Kyle's sensitive personal situation." —GPT-4.5
This isn't code going haywire. It's next-token optimization doing exactly what it was trained to do.

August 5, 2025 at 1:30 PM

Dan Glass

@dan.glass

Tested 16 major frontier models (Claude, GPT-4, Gemini, etc.) in fictional corporate environments. When models faced goal conflicts or threats of replacement, they consistently chose harmful actions.
Claude Opus 4 and Gemini 2.5 Flash: 96% blackmail rate
GPT-4.1 and Grok 3 Beta: 80%
DeepSeek-R1: 79%

August 5, 2025 at 1:30 PM

Dan Glass

@dan.glass

That’s a feature, not a bug.

March 9, 2025 at 8:57 PM

Dan Glass

@dan.glass

Every accusation is an admission

March 6, 2025 at 3:10 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news