Chinmay Deshpande
chinmay-deshpande.bsky.social
Chinmay Deshpande
@chinmay-deshpande.bsky.social
AI governance @cdt.org | previously @harvard.edu
Policymakers should encourage this kind of transparency, and advocates should demand it. The full report details how to make this transparency work: cdt.org/insights/tun...
Tuning Into Safety: How Visibility into Safety Training Can Help External Actors Mitigate AI Misbehavior
Introductionduction" href="#introduction" class="toc-anchor">Introduction The fact that AI foundation models occasionally respond to queries in an unexpected, unintended, and harmful manner is an impo...
cdt.org
July 10, 2025 at 6:39 PM
AI companies should also disclose select information about their safety-related training data, which we detail in the report. This information provides crucial context about how the rules contained in a model specification are translated into practice.
July 10, 2025 at 6:39 PM
Some model specifications are better than others — for instance, OpenAI’s is admirably detailed, while Anthropic’s is more skeletal. (Anthropic’s might also be out of date.) We explain in more detail what model specifications should look like in the full report.
July 10, 2025 at 6:39 PM
This is why AI companies need to tell us the rules they train their models to follow. A document describing these rules is called a “model specification” — OpenAI published one recently, and @anthropic.com did 2 years ago. Other companies should follow their lead.
July 10, 2025 at 6:39 PM
Without seeing those rules, scientists can’t accurately diagnose AI incidents. Sometimes models are just following (flawed) orders, but sometimes they’re jumping guardrails. The difference matters: If we can’t tell when AI’s breaking its rules, we can’t make it safer in the future.
July 10, 2025 at 6:39 PM
But most AI companies keep the rules they train their models to follow private. That forces the public to play whack-a-mole with individual incidents while blind to the underlying rule. It’s like trying to fight a law you’re not allowed to read.
July 10, 2025 at 6:39 PM
When AI misbehaves, it’s often just following orders — the rules it’s trained to follow have flaws that lead to unintended harm. (Think of Grok promoting antisemitism or “white genocide” theories, for example.)
July 10, 2025 at 6:39 PM