Lightnews — Scholar-powered news

Chaitanya Malaviya

@cmalaviya.bsky.social

Check out our paper and counterfactual data below! 👇
• Paper: arxiv.org/abs/2506.05339
• Data: huggingface.co/datasets/abh...
• Code: github.com/anirudhb123/...

Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models

Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. ...

arxiv.org

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Our findings suggest that targeted debiasing using counterfactuals can help build more reliable preference models, a key step for both LLM alignment and evaluation.

Work led by Anirudh and done jointly with Nitish and @yatskar.bsky.social .

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

For instance, miscalibration for vagueness dropped from 51.3% to 28.5% and for jargon from 50.3% to 33.2% after CDA.

Even joint debiasing across multiple biases (length, vagueness, jargon) proved effective with minimal impact on general capabilities.

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

And the results? CDA works!

It significantly reduced average miscalibration (e.g., from 39.4% to 32.5%) and brought model skew much closer to human preferences. All this while maintaining overall performance on RewardBench!

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

So how do we debias models? We propose a simple yet effective post-training method based on counterfactual data augmentation (CDA).

We synthesize contrastive responses that explicitly magnify biases in dispreferred responses, & further finetune reward models on these responses.

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Indeed, preference models can easily latch on to these subtle data artifacts!

Features that only weakly correlate with human preferences (r_human=−0.12) are strongly predictive for models (r_model=0.36). Points above y=x suggest that models overrely on these spurious cues😮

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Where do these biases come from?🤔Our analysis suggests they originate from training data artifacts.

For eg, humans preferred structured responses >65% of the time when the alternative wasn't structured. This gives an opportunity for models to learn these patterns as heuristics!

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

How severe is the problem? Using controlled counterfactual pairs, we found that preference models (incl. LLM evaluators) prefer biased responses in >60% of cases (defined as skew) and show high miscalibration (~40%) wrt humans.

Vagueness & sycophancy are especially problematic!

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Preference models act as proxies for human judgements in alignment (as reward models) & evaluation, but they can be miscalibrated.

We found that they overrely on many idiosyncratic features of AI-generated text, which can lead to reward hacking & unreliable evals. Features like:

June 6, 2025 at 4:32 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Joint work done @ai2.bsky.social with Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, @kylelo.bsky.social .

Find these & many more results in our paper: arxiv.org/abs/2411.07237
Use our code: github.com/allenai/Cont...
Explore our data: huggingface.co/datasets/all...

Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

Language model users often issue queries that lack specification, where the context under which a query was issued -- such as the user's identity, the query's intent, and the criteria for a response t...

arxiv.org

November 13, 2024 at 2:16 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

🤔 How can we use context to learn more about model behavior?

We can study "default" responses from models. Under what type of context does their response get highest score?

We uncover a bias towards WEIRD contexts (Western, Educated, Industrialized, Rich & Democratic)!

November 13, 2024 at 2:16 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

🤔 Does providing context to evaluators have a substantial effect on evaluation conclusions?

We find that (1) presence of context can improve agreement between evaluators and (2) even change model rankings! 🤯

November 13, 2024 at 2:16 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

...we then conduct experiments providing context (1) during response generation, (2) during evaluation or (3) both.

November 13, 2024 at 2:16 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

With ✨Contextualized Evaluations✨, we synthetically generate context as clarifying, follow-up questions to an underspecified query...

November 13, 2024 at 2:16 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Underspecified queries can lead to arbitrary evaluation judgments of response quality!

e.g., Given a query “Is coffee good for you?”, how can evaluators accurately judge model responses when they aren't informed about the user’s preferences, background or important criteria?

November 13, 2024 at 2:16 PM

Chaitanya Malaviya

@cmalaviya.bsky.social

Underspecified queries are prevalent in many datasets used to benchmark language models (e.g., Chatbot Arena, AlpacaEval).

These can be ambiguous (e.g., what is a transformer? ... 🤔 for NLP or EE?), subjective (e.g., who is the best? ... 🤔 what criteria?), and more!

November 13, 2024 at 2:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news