Lightnews — Scholar-powered news

Joel Mire

@joelmire.bsky.social

88 followers 200 following 11 posts

Master’s student @ltiatcmu.bsky.social. he/him

Posts Replies Media Videos

Joel Mire

@joelmire.bsky.social

This looks incredible! Thanks for sharing the syllabus!

June 4, 2025 at 5:23 PM

Joel Mire

@joelmire.bsky.social

This was joint work with my co-author Zubin Aysola; collaborators @dchechel.bsky.social, Nick Deas, and @chryssazrv.bsky.social; and advisor @maartensap.bsky.social at @ltiatcmu.bsky.social @scsatcmu.bsky.social @columbiauniversity.bsky.social, and the @istecnico.bsky.social! (10/10)

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Our work builds on sociolinguistic and NLP research on AAL and recent translation methods. Check out the paper for details! We hope others extend this work, e.g., to investigate or mitigate reward model biases against more dialects. (9/10)

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

These results point to representational and quality-of-service harms for AAL speakers. ⚠️They also highlight complex ethical questions about the desired behavior of LLMs concerning AAL. (8/10)

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Finally, we show that the reward models strongly incentivize steering conversations toward WME, even when prompted with AAL. 🗣️🔄 (7/10)

Bar chart showing the results from t-tests comparing rewards assigned to dialect mirroring (completion dialect matches prompt dialect) vs non-mirroring conditions (completion dialect differs from prompt dialect) inputs. The results show statistically significant preferences for responding in WME--regardless of whether the prompt was WME or AAL--for all models.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Also, for most models, rewards are negatively correlated with the predicted AAL-ness of a text (based on a pre-existing dialect detection tool). (6/10)

Bar chart showing Pearson correlation coefficients between reward model score and AAL-ness score from a pre-existing dialect detection tool. The chart shows a statistically significant negative correlation between these variables for most models.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

Next, we show that most reward models predict lower rewards for AAL texts ⬇️ (5/10)

Bar chart showing the cohen's d effect sizes from t-tests comparing raw reward scores assigned to WME vs. AAL texts. All results show a significant dispreference for AAL texts.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

First, we see a significant drop in performance (-4% accuracy on average) in assigning higher rewards to human-preferred completions when processing AAL texts vs. WME texts. 📉 (4/10)

Line chart showing that reward models are less accurate at assigning higher rewards to human-preferred completions when processing paired WME vs. AAL texts.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

We introduce morphosyntactic & phonological features of AAL into WME texts from the RewardBench dataset using validated automatic translation methods. Then, we test 17 reward models for implicit anti-AAL dialect biases. 📊 (3/10)

Diagram depicting several ways we combine prompts and completions in White Mainstream English (WME) and African American Language (AAL) to evaluate dialect biases in reward models. Also, the image contains text summaries of our main findings: accuracy drop for AAL, moderate dispreference for AAL-aligned texts, and WME responses for AAL prompts.

March 6, 2025 at 7:49 PM

Joel Mire

@joelmire.bsky.social

We develop a framework for evaluating dialect biases in reward models and conduct a case study on biases against African American Language (AAL) relative to White Mainstream English (WME). 🔍 (2/10)

March 6, 2025 at 7:49 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news