Lightnews — Scholar-powered news

Chinmay Deshpande

@chinmay-deshpande.bsky.social

Policymakers should encourage this kind of transparency, and advocates should demand it. The full report details how to make this transparency work: cdt.org/insights/tun...

Tuning Into Safety: How Visibility into Safety Training Can Help External Actors Mitigate AI Misbehavior

Introductionduction" href="#introduction" class="toc-anchor">Introduction The fact that AI foundation models occasionally respond to queries in an unexpected, unintended, and harmful manner is an impo...

cdt.org

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

AI companies should also disclose select information about their safety-related training data, which we detail in the report. This information provides crucial context about how the rules contained in a model specification are translated into practice.

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

Some model specifications are better than others — for instance, OpenAI’s is admirably detailed, while Anthropic’s is more skeletal. (Anthropic’s might also be out of date.) We explain in more detail what model specifications should look like in the full report.

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

This is why AI companies need to tell us the rules they train their models to follow. A document describing these rules is called a “model specification” — OpenAI published one recently, and @anthropic.com did 2 years ago. Other companies should follow their lead.

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

Without seeing those rules, scientists can’t accurately diagnose AI incidents. Sometimes models are just following (flawed) orders, but sometimes they’re jumping guardrails. The difference matters: If we can’t tell when AI’s breaking its rules, we can’t make it safer in the future.

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

But most AI companies keep the rules they train their models to follow private. That forces the public to play whack-a-mole with individual incidents while blind to the underlying rule. It’s like trying to fight a law you’re not allowed to read.

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

When AI misbehaves, it’s often just following orders — the rules it’s trained to follow have flaws that lead to unintended harm. (Think of Grok promoting antisemitism or “white genocide” theories, for example.)

July 10, 2025 at 6:39 PM

Chinmay Deshpande

@chinmay-deshpande.bsky.social

cdt.org/insights/tun...

Tuning Into Safety: How Visibility into Safety Training Can Help External Actors Mitigate AI Misbehavior

Introductionduction" href="#introduction" class="toc-anchor">Introduction The fact that AI foundation models occasionally respond to queries in an unexpected, unintended, and harmful manner is an impo...

cdt.org

July 10, 2025 at 6:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news