https://kshitishghate.github.io/
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social
@andyliu.bsky.social @devanshrjain.bsky.social @taylor-sorensen.bsky.social @atoosakz.bsky.social @aylincaliskan.bsky.social @monadiab77.bsky.social @maartensap.bsky.social
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
• Models choose style-aligned responses 57-73% of the time
• Persists even with explicit instructions to prioritize values
• Consistent across all model sizes and types
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
• Secular over traditional values
• Self-expression over survival values
• Verbose, confident, and formal/cold language
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!
We carried out the largest systematic study so far to map the links between upstream choices, intrinsic bias, and downstream zero-shot performance across 131 CLIP Vision-language encoders, 26 datasets, and 55 architectures!