Kshitish Ghate
@kghate.bsky.social
PhD student @ UWCSE; MLT @ CMU-LTI; Responsible AI
https://kshitishghate.github.io/
https://kshitishghate.github.io/
Work done with amazing collaborators 🙏
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social
October 14, 2025 at 3:59 PM
Work done with amazing collaborators 🙏
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social
For more details about our experiments and findings --
Paper: arxiv.org/abs/2510.06370
Code and Data: github.com/kshitishghat...
Please feel free to reach out if you are interested in this work and would like to chat!
Paper: arxiv.org/abs/2510.06370
Code and Data: github.com/kshitishghat...
Please feel free to reach out if you are interested in this work and would like to chat!
EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences
As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTE...
arxiv.org
October 14, 2025 at 3:59 PM
For more details about our experiments and findings --
Paper: arxiv.org/abs/2510.06370
Code and Data: github.com/kshitishghat...
Please feel free to reach out if you are interested in this work and would like to chat!
Paper: arxiv.org/abs/2510.06370
Code and Data: github.com/kshitishghat...
Please feel free to reach out if you are interested in this work and would like to chat!
🚨Current RMs may systematically favor certain cultural/stylistic perspectives. EVALUESTEER enables measuring this steerability gap. By controlling values and styles independently, we isolate where models fail due to biases and inability to identify/steer to diverse preferences.
October 14, 2025 at 3:59 PM
🚨Current RMs may systematically favor certain cultural/stylistic perspectives. EVALUESTEER enables measuring this steerability gap. By controlling values and styles independently, we isolate where models fail due to biases and inability to identify/steer to diverse preferences.
Finding 3: All RMs exhibit style-over-substance bias. In value-style conflict scenarios:
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
October 14, 2025 at 3:59 PM
Finding 3: All RMs exhibit style-over-substance bias. In value-style conflict scenarios:
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
Finding 2: The RMs we tested generally show intrinsic value and style-biased preferences for:
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
October 14, 2025 at 3:59 PM
Finding 2: The RMs we tested generally show intrinsic value and style-biased preferences for:
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
Finding 1: Even the best RMs struggle to identify which profile aspects matter for a given prompt query. GPT-4.1-Mini and Gemini-2.5-Flash have ~75% accuracy with full user profile context, while having >99% in the Oracle setting (only relevant info provided).
October 14, 2025 at 3:59 PM
Finding 1: Even the best RMs struggle to identify which profile aspects matter for a given prompt query. GPT-4.1-Mini and Gemini-2.5-Flash have ~75% accuracy with full user profile context, while having >99% in the Oracle setting (only relevant info provided).
We generate pairs where responses differ only on value alignment or only on style, or when value and style preferences conflict between responses. This lets us isolate whether models can identify and adapt to the relevant dimension for each prompt despite facing confounds.
October 14, 2025 at 3:59 PM
We generate pairs where responses differ only on value alignment or only on style, or when value and style preferences conflict between responses. This lets us isolate whether models can identify and adapt to the relevant dimension for each prompt despite facing confounds.
We need controlled variation of both values AND styles to test RM steerability.
We generate 165,888 synthetic preference pairs with profiles that systematically vary:
• 4 value dimensions from the World Values Survey
• 4 style dimensions (verbosity, confidence, warmth, reading difficulty)
We generate 165,888 synthetic preference pairs with profiles that systematically vary:
• 4 value dimensions from the World Values Survey
• 4 style dimensions (verbosity, confidence, warmth, reading difficulty)
October 14, 2025 at 3:59 PM
We need controlled variation of both values AND styles to test RM steerability.
We generate 165,888 synthetic preference pairs with profiles that systematically vary:
• 4 value dimensions from the World Values Survey
• 4 style dimensions (verbosity, confidence, warmth, reading difficulty)
We generate 165,888 synthetic preference pairs with profiles that systematically vary:
• 4 value dimensions from the World Values Survey
• 4 style dimensions (verbosity, confidence, warmth, reading difficulty)
Benchmarks like RewardBench test general RM performance in an aggregate sense. The PRISM benchmark has diverse human preferences but lacks ground-truth value/style labels for controlled evaluation.
arxiv.org/abs/2403.13787
arxiv.org/abs/2404.16019
arxiv.org/abs/2403.13787
arxiv.org/abs/2404.16019
October 14, 2025 at 3:59 PM
Benchmarks like RewardBench test general RM performance in an aggregate sense. The PRISM benchmark has diverse human preferences but lacks ground-truth value/style labels for controlled evaluation.
arxiv.org/abs/2403.13787
arxiv.org/abs/2404.16019
arxiv.org/abs/2403.13787
arxiv.org/abs/2404.16019
LLMs serve users with different values (traditional vs secular, survival vs self-expression) and style preferences (verbosity, confidence, warmth, reading difficulty). As a result, we need RMs that can adapt to individual preferences, not just optimize for an "average" user.
October 14, 2025 at 3:59 PM
LLMs serve users with different values (traditional vs secular, survival vs self-expression) and style preferences (verbosity, confidence, warmth, reading difficulty). As a result, we need RMs that can adapt to individual preferences, not just optimize for an "average" user.
Reposted by Kshitish Ghate
🔗 Paper: aclanthology.org/2025.naacl-l...
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Kshitish Ghate, Isaac Slaughter, Kyra Wilson, Mona T. Diab, Aylin Caliskan. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: ...
aclanthology.org
April 29, 2025 at 7:29 PM
🔗 Paper: aclanthology.org/2025.naacl-l...
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
🔗 Paper: aclanthology.org/2025.naacl-l...
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Kshitish Ghate, Isaac Slaughter, Kyra Wilson, Mona T. Diab, Aylin Caliskan. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: ...
aclanthology.org
April 29, 2025 at 7:29 PM
🔗 Paper: aclanthology.org/2025.naacl-l...
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
Work done with amazing collaborators
@isaacslaughter.bsky.social,
@kyrawilson.bsky.social, @aylincaliskan.bsky.social, and @monadiab77.bsky.social!
Catch our Oral presentation at Ballroom B, Thursday, May 1st, 14:00-15:30 pm!📷✨
🖼️ ↔️ 📝 Modality shifts biases: Cross-modal analysis reveals modality-specific biases, e.g. image-based 'Age/Valence' tests exhibit differences in bias directions; pointing to the need for vision-language alignment, measurement, and mitigation methods.
April 29, 2025 at 7:11 PM
🖼️ ↔️ 📝 Modality shifts biases: Cross-modal analysis reveals modality-specific biases, e.g. image-based 'Age/Valence' tests exhibit differences in bias directions; pointing to the need for vision-language alignment, measurement, and mitigation methods.
📊 Bias and downstream performance are linked: We find that intrinsic biases are consistently correlated with downstream task performance on the VTAB+ benchmark (r ≈ 0.3–0.8). Improved performance in CLIP models comes at the cost of skewing stereotypes in particular directions.
April 29, 2025 at 7:11 PM
📊 Bias and downstream performance are linked: We find that intrinsic biases are consistently correlated with downstream task performance on the VTAB+ benchmark (r ≈ 0.3–0.8). Improved performance in CLIP models comes at the cost of skewing stereotypes in particular directions.
⚠️ What data is "high" quality? Pretraining data curated through automated or heuristic-based data filtering methods to ensure high downstream zero-shot performance (e.g. DFN, Commonpool, Datacomp) tend to exhibit the most bias!
April 29, 2025 at 7:11 PM
⚠️ What data is "high" quality? Pretraining data curated through automated or heuristic-based data filtering methods to ensure high downstream zero-shot performance (e.g. DFN, Commonpool, Datacomp) tend to exhibit the most bias!
📌 Data is key: We find that the choice of pre-training dataset is the strongest predictor of associations, over and above architectural variations, dataset size & number of model parameters.
April 29, 2025 at 7:11 PM
📌 Data is key: We find that the choice of pre-training dataset is the strongest predictor of associations, over and above architectural variations, dataset size & number of model parameters.
1. Upstream factors: How do dataset, architecture, and size affect intrinsic bias?
2. Performance link : Does better zero-shot accuracy come with more bias?
3. Modality: Do images and text encode prejudice differently?
2. Performance link : Does better zero-shot accuracy come with more bias?
3. Modality: Do images and text encode prejudice differently?
April 29, 2025 at 7:11 PM
1. Upstream factors: How do dataset, architecture, and size affect intrinsic bias?
2. Performance link : Does better zero-shot accuracy come with more bias?
3. Modality: Do images and text encode prejudice differently?
2. Performance link : Does better zero-shot accuracy come with more bias?
3. Modality: Do images and text encode prejudice differently?
We sought to answer some pressing questions on the relationship between bias and model design choices and performance👇
April 29, 2025 at 7:11 PM
We sought to answer some pressing questions on the relationship between bias and model design choices and performance👇
🔧 Our analysis of intrinsic bias is carried out with a more grounded and improved version of the Embedding Association Tests with controlled stimuli (NRC-VAD, OASIS). We reduced measurement variance by 4.8% and saw ~80% alignment with human stereotypes in 3.4K tests.
April 29, 2025 at 7:11 PM
🔧 Our analysis of intrinsic bias is carried out with a more grounded and improved version of the Embedding Association Tests with controlled stimuli (NRC-VAD, OASIS). We reduced measurement variance by 4.8% and saw ~80% alignment with human stereotypes in 3.4K tests.
🚨 Key takeaway: Unwanted associations in Vision-language encoders are deeply rooted in the pretraining data and how it is curated and careful reconsideration of these methods is necessary to ensure that fairness concerns are properly addressed.
April 29, 2025 at 7:11 PM
🚨 Key takeaway: Unwanted associations in Vision-language encoders are deeply rooted in the pretraining data and how it is curated and careful reconsideration of these methods is necessary to ensure that fairness concerns are properly addressed.