Lightnews — Scholar-powered news

Joel Mire

@joelmire.bsky.social

88 followers 200 following 11 posts

Master’s student @ltiatcmu.bsky.social. he/him

Posts Replies Media Videos

Joel Mire

@joelmire.bsky.social

Finally, we show that the reward models strongly incentivize steering conversations toward WME, even when prompted with AAL. 🗣️🔄 (7/10)

Bar chart showing the results from t-tests comparing rewards assigned to dialect mirroring (completion dialect matches prompt dialect) vs non-mirroring conditions (completion dialect differs from prompt dialect) inputs. The results show statistically significant preferences for responding in WME--regardless of whether the prompt was WME or AAL--for all models.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Also, for most models, rewards are negatively correlated with the predicted AAL-ness of a text (based on a pre-existing dialect detection tool). (6/10)

Bar chart showing Pearson correlation coefficients between reward model score and AAL-ness score from a pre-existing dialect detection tool. The chart shows a statistically significant negative correlation between these variables for most models.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Next, we show that most reward models predict lower rewards for AAL texts ⬇️ (5/10)

Bar chart showing the cohen's d effect sizes from t-tests comparing raw reward scores assigned to WME vs. AAL texts. All results show a significant dispreference for AAL texts.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

First, we see a significant drop in performance (-4% accuracy on average) in assigning higher rewards to human-preferred completions when processing AAL texts vs. WME texts. 📉 (4/10)

Line chart showing that reward models are less accurate at assigning higher rewards to human-preferred completions when processing paired WME vs. AAL texts.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

We introduce morphosyntactic & phonological features of AAL into WME texts from the RewardBench dataset using validated automatic translation methods. Then, we test 17 reward models for implicit anti-AAL dialect biases. 📊 (3/10)

Diagram depicting several ways we combine prompts and completions in White Mainstream English (WME) and African American Language (AAL) to evaluate dialect biases in reward models. Also, the image contains text summaries of our main findings: accuracy drop for AAL, moderate dispreference for AAL-aligned texts, and WME responses for AAL prompts.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Reward models for LMs are meant to align outputs with human preferences—but do they accidentally encode dialect biases? 🤔

Excited to share our paper on biases against African American Language in reward models, accepted to #NAACL2025 Findings! 🎉

Paper: arxiv.org/abs/2502.12858 (1/10)

Screenshot of Arxiv paper title, "Rejected Dialects: Biases Against African American Language in Reward Models," and author list: Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap.

March 6, 2025 at 7:49 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news