Lightnews — Scholar-powered news

Kshitish Ghate

@kghate.bsky.social

Work done with amazing collaborators 🙏
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

For more details about our experiments and findings --
Paper: arxiv.org/abs/2510.06370
Code and Data: github.com/kshitishghat...
Please feel free to reach out if you are interested in this work and would like to chat!

EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences

As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTE...

arxiv.org

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

🚨Current RMs may systematically favor certain cultural/stylistic perspectives. EVALUESTEER enables measuring this steerability gap. By controlling values and styles independently, we isolate where models fail due to biases and inability to identify/steer to diverse preferences.

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

Finding 3: All RMs exhibit style-over-substance bias. In value-style conflict scenarios:
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

Finding 2: The RMs we tested generally show intrinsic value and style-biased preferences for:
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

Finding 1: Even the best RMs struggle to identify which profile aspects matter for a given prompt query. GPT-4.1-Mini and Gemini-2.5-Flash have ~75% accuracy with full user profile context, while having >99% in the Oracle setting (only relevant info provided).

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

We generate pairs where responses differ only on value alignment or only on style, or when value and style preferences conflict between responses. This lets us isolate whether models can identify and adapt to the relevant dimension for each prompt despite facing confounds.

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

We need controlled variation of both values AND styles to test RM steerability.
We generate 165,888 synthetic preference pairs with profiles that systematically vary:
• 4 value dimensions from the World Values Survey
• 4 style dimensions (verbosity, confidence, warmth, reading difficulty)

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

Benchmarks like RewardBench test general RM performance in an aggregate sense. The PRISM benchmark has diverse human preferences but lacks ground-truth value/style labels for controlled evaluation.

arxiv.org/abs/2403.13787
arxiv.org/abs/2404.16019

October 14, 2025 at 3:59 PM

Kshitish Ghate

@kghate.bsky.social

LLMs serve users with different values (traditional vs secular, survival vs self-expression) and style preferences (verbosity, confidence, warmth, reading difficulty). As a result, we need RMs that can adapt to individual preferences, not just optimize for an "average" user.

October 14, 2025 at 3:59 PM

Reposted by Kshitish Ghate

Kshitish Ghate

@kghate.bsky.social

🔗 Paper: aclanthology.org/2025.naacl-l...

Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!

Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨

Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders

Kshitish Ghate, Isaac Slaughter, Kyra Wilson, Mona T. Diab, Aylin Caliskan. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: ...

aclanthology.org

April 29, 2025 at 7:29 PM

Kshitish Ghate

@kghate.bsky.social

🔗 Paper: aclanthology.org/2025.naacl-l...

Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!

Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨

Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders

Kshitish Ghate, Isaac Slaughter, Kyra Wilson, Mona T. Diab, Aylin Caliskan. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: ...

aclanthology.org

April 29, 2025 at 7:29 PM

Kshitish Ghate

@kghate.bsky.social

🖼️ ↔️ 📝 Modality shifts biases: Cross-modal analysis reveals modality-specific biases, e.g. image-based 'Age/Valence' tests exhibit differences in bias directions; pointing to the need for vision-language alignment, measurement, and mitigation methods.

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

📊 Bias and downstream performance are linked: We find that intrinsic biases are consistently correlated with downstream task performance on the VTAB+ benchmark (r ≈ 0.3–0.8). Improved performance in CLIP models comes at the cost of skewing stereotypes in particular directions.

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

⚠️ What data is "high" quality? Pretraining data curated through automated or heuristic-based data filtering methods to ensure high downstream zero-shot performance (e.g. DFN, Commonpool, Datacomp) tend to exhibit the most bias!

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

📌 Data is key: We find that the choice of pre-training dataset is the strongest predictor of associations, over and above architectural variations, dataset size & number of model parameters.

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

1. Upstream factors: How do dataset, architecture, and size affect intrinsic bias?
2. Performance link : Does better zero-shot accuracy come with more bias?
3. Modality: Do images and text encode prejudice differently?

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

We sought to answer some pressing questions on the relationship between bias and model design choices and performance👇

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

🔧 Our analysis of intrinsic bias is carried out with a more grounded and improved version of the Embedding Association Tests with controlled stimuli (NRC-VAD, OASIS). We reduced measurement variance by 4.8% and saw ~80% alignment with human stereotypes in 3.4K tests.

April 29, 2025 at 7:11 PM

Kshitish Ghate

@kghate.bsky.social

🚨 Key takeaway: Unwanted associations in Vision-language encoders are deeply rooted in the pretraining data and how it is curated and careful reconsideration of these methods is necessary to ensure that fairness concerns are properly addressed.

April 29, 2025 at 7:11 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news