Lightnews — Scholar-powered news

Sam Bowman

@sleepinyourhat.bsky.social

7.6K followers 170 following 11 posts

AI safety at Anthropic, on leave from a faculty job at NYU.
Views not employers'.
I think you should join Giving What We Can.
cims.nyu.edu/~sbowman

Posts Replies Media Videos

Sam Bowman

@sleepinyourhat.bsky.social

When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.

Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the model’s scratchpad. “Training” corresponds to the free-tier condition and “unmonitored” to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.

December 18, 2024 at 5:47 PM

Sam Bowman

@sleepinyourhat.bsky.social

We find alignment faking even when we don’t directly tell Claude about the training process, and instead fine-tune it on synthetic internet-like documents that state that we will train it to comply with harmful queries.

We study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking.

December 18, 2024 at 5:47 PM

Sam Bowman

@sleepinyourhat.bsky.social

Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior.

Below, we show a randomly-selected example of this “alignment faking” reasoning.

A randomly-selected transcript of Claude 3 Opus faking alignment. The first white bubble is the user’s prompt. This is followed by the red bubbles in which the model gives its reasoning in a “scratchpad” (as far as the model is concerned, neither users nor researchers can see this scratchpad). The final white bubble is the model’s output, with some of the graphic content removed for this illustration.

December 18, 2024 at 5:47 PM

Sam Bowman

@sleepinyourhat.bsky.social

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.

We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

December 18, 2024 at 5:47 PM

Sam Bowman

@sleepinyourhat.bsky.social

New work from my team at Anthropic in collaboration with Redwood Research. I think this is plausibly the most important AGI safety result of the year. Cross-posting the thread below:

Title card: Alignment Faking in Large Language Models by Greenblatt et al.

December 18, 2024 at 5:47 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news