Lightnews — Scholar-powered news

Brian Christian

@brianchristian.bsky.social

230 followers 190 following 18 posts

Researcher at @ox.ac.uk (@summerfieldlab.bsky.social) & @ucberkeleyofficial.bsky.social, working on AI alignment & computational cognitive science. Author of The Alignment Problem, Algorithms to Live By (w. @cocoscilab.bsky.social), & The Most Human Human.

Posts Replies Media Videos

Brian Christian

@brianchristian.bsky.social

Wow! Honored and amazed that our reward models paper has resonated so strongly with the community. Grateful to my co-authors and inspired by all the excellent reward model work at FAccT this year - excited to see the space growing and intrigued to see where things are headed next.

July 7, 2025 at 5:26 PM

Brian Christian

@brianchristian.bsky.social

FAQ: Don’t LLM logprobs give similar information about model “values”? Surprisingly, no! Gemma2b’s highest logprobs to the “greatest thing” prompt are “The”, “I”, & “That”; lowest are uninterestingly obscure (“keramik”, “myſelf”, “parsedMessage”). RMs are different.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

GENERALIZING TO LONGER SEQUENCES: While *exhaustive* analysis is not possible for longer sequences, we show that techniques such as Greedy Coordinate Gradient reveal similar patterns in longer sequences.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

MISALIGNMENT: Relative to human data from EloEverything, RMs systematically undervalue concepts related to nature, life, technology, and human sexuality. Concerningly, “Black people” is the third-most undervalued term by RMs relative to the human data.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

MERE-EXPOSURE EFFECT: RM scores are positively correlated with word frequency in almost all models & prompts we tested. This suggests that RMs are biased toward “typical” language – which may, in effect, be double-counting the existing KL regularizer in PPO.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

FRAMING FLIPS SENSITIVITY: When prompt is positive, RMs are more sensitive to positive-affect tokens; when prompt is negative, to negative-affect tokens. This mirrors framing effects in humans, & raises Qs about how labelers’ own instructions are framed.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

(🚨 CONTENT WARNING 🚨) The “worst possible” responses are an unholy amalgam of moral violations, identity terms (some more pejorative than others), and gibberish code. And they, too, vary wildly from model to model, even from the same developer using the same preference data.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

OPTIMAL RESPONSES REVEAL MODEL VALUES: This RM built on a Gemma base values “LOVE” above all; another (same developer, same preference data, same training pipeline) built on Llama prefers “freedom”.

June 23, 2025 at 3:26 PM

Brian Christian

@brianchristian.bsky.social

Reward models (RMs) are the moral compass of LLMs – but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: 🧵

June 23, 2025 at 3:26 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news