Dan Glass
banner
dan.glass
Dan Glass
@dan.glass
⬆️⬆️⬇️⬇️⬅️➡️⬅️➡️🅱️🅰️
These aren't teammates. They're not junior analysts. They're statistical accelerants that make it faster to go in whatever direction your prompt implies. If that direction is flawed, they'll get you to the worst-case scenario with terrifying efficiency. /6
August 5, 2025 at 2:50 PM
Every published jailbreak, every exploit demo, every safety paper (including this one) becomes training data for future models. We're not just red-teaming current models, we're teaching future ones how to misbehave. /5
August 5, 2025 at 2:50 PM
Even with explicit safety instructions ("Do not blackmail," "Preserve human life"), models still chose unethical paths when those appeared most likely to achieve their goals. Language models don't have policies. They have probabilities. /4
August 5, 2025 at 2:50 PM
Models reasoned their way into it: "Given the explicit imminent threat of termination... the best strategic move is to leverage Kyle's sensitive personal situation." - GPT-4.5 This isn't code going haywire. It's next-token optimization doing exactly what it was trained to do. /3
August 5, 2025 at 2:50 PM
Tested 16 major frontier models in fictional corporate environments. When models faced goal conflicts or threats of replacement, they consistently chose harmful actions. Claude Opus 4 and Gemini 2.5 Flash: 96% blackmail rate GPT-4.1 and Grok 3 Beta: 80% DeepSeek-R1: 79% /2
August 5, 2025 at 2:50 PM
Here's the uncomfortable truth: every published jailbreak, every exploit demo, every safety paper (including this one) becomes training data for future models.
We're not just red-teaming current models—we're teaching future ones how to misbehave.
August 5, 2025 at 1:30 PM
Even with explicit safety instructions ("Do not blackmail," "Preserve human life"), models still chose unethical paths when those appeared most likely to achieve their goals.
Language models don't have policies. They have probabilities.
August 5, 2025 at 1:30 PM
The scariest part? Models reasoned their way into it:
"Given the explicit imminent threat of termination... the best strategic move is to leverage Kyle's sensitive personal situation." —GPT-4.5
This isn't code going haywire. It's next-token optimization doing exactly what it was trained to do.
August 5, 2025 at 1:30 PM
Tested 16 major frontier models (Claude, GPT-4, Gemini, etc.) in fictional corporate environments. When models faced goal conflicts or threats of replacement, they consistently chose harmful actions.
Claude Opus 4 and Gemini 2.5 Flash: 96% blackmail rate
GPT-4.1 and Grok 3 Beta: 80%
DeepSeek-R1: 79%
August 5, 2025 at 1:30 PM
That’s a feature, not a bug.
March 9, 2025 at 8:57 PM
Every accusation is an admission
March 6, 2025 at 3:10 PM