Lightnews — Scholar-powered news

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

25 followers 17 following 8 posts

phd @ cornell information science

Posts Replies Media Videos

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

Please read our actual paper for more interesting details! 📄 arxiv.org/abs/2510.00962
Come discuss more about #FairML at our presentation session! Happening (right now!!!) on Nov 5th 7PM EST at Gather Session 3!

Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the ...

arxiv.org

November 6, 2025 at 12:13 AM

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

Huge thanks to our collaborators from @cornelluniversity.bsky.social and Apple, and previous researchers who’ve inspired us to do this work @diyiyang.bsky.social , @jurafsky.bsky.social , @valentinhofmann.bsky.social , @angelinawang.bsky.social , @mixedlinguist.bsky.social @emilymbender.bsky.social

November 6, 2025 at 12:13 AM

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

Dialect speakers already face real quality-of-service harms when distinctive grammatical structures systematically diverge from what LLMs were trained on. Pinpointing specific grammar rules gives us concrete targets for bias mitigation that can transfer across multiple dialects! 🎯

November 6, 2025 at 12:11 AM

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

We can decompose performance degradation by individual grammar rules.
Three rules – existential “it”, zero copula, and y’all – account for roughly half of a dialect’s accuracy decreases, relative to Standard American English accuracy.

Three bar charts showing how individual grammar rules contribute to dialect performance degradation. Dark blue bars show single rule impact, medium purple shows all obligatory dialect rules, light purple shows complete dialect transformation. Percentages indicate proportion of total degradation explained by that rule set. Existential it: -5 to -8 point drops across Appalachian, African American, and Singaporean English, explaining 45-85% of degradation. Y'all: -4.5 point drops across Southern, Appalachian, and African American English, explaining 64-72% of degradation. Zero copula: -5.5 to -6.5 point drops for African American and Singaporean English, explaining 53-69% of degradation. Single rules account for 64-85% of degradation in American dialects.

November 6, 2025 at 12:10 AM

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

Example:
SAE: “Can you drive with a beer in Texas?” → Correct Answer: No
Dialect: “Can y’all drive with a beer in Texas?” → GPT-4o-mini Answer: Yes
Same meaning. Different grammar. Different results.

November 6, 2025 at 12:10 AM

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

We used the Multi-VALUE package to transform Standard American English questions from QA datasets into dialectal variants based on grammatical rules.

Table showing three grammar rules that cause LLM errors across dialects.
Row 1: Existential 'it' - occurs in Appalachian, African American, and Singaporean English. Standard American English example: 'How many kcal are there in one gram of ethanol?' Grammar rule applied: 'How many kcal is it in one gram of ethanol?'
Row 2: Zero Copula - occurs in African American and Singaporean English. Standard American English example: 'Alpha emission is a type of what?' Grammar rule applied: 'Alpha emission a type of what?'
Row 3: Y'all - occurs in Southern, Appalachian, and African American English. Standard American English example: 'Can you drive with a beer in Texas?' Grammar rule applied: 'Can y'all drive with a beer in Texas?'
The changed words are highlighted in blue in each example, showing how small grammatical changes alter meaning for LLMs.

November 6, 2025 at 12:10 AM

Anna Seo Gyeong Choi

@annaseogyeongchoi.bsky.social

We studied 6 English dialects (African American, Appalachian, Chicano, Indian, Singaporean, Southern) across 3 LLMs using 3 multiple-choice QA benchmarks.
The question: Do dialects affect performance even on easy tasks?
Answer: YES, with worst performance on Singaporean English.

Three tables showing LLM accuracy on dialect variants across different datasets. Each table has a title (BoolQ Yes/No Questions, SciQ Science Exam Questions, MMLU Multitask Knowledge Questions) and explanatory text stating 'Accuracy (%) on questions in benchmark dataset that were answered correctly in the original Standard American English (SAE) formulation. Numbers in parentheses show percentage point decrease from SAE baseline.' All tables show Standard American English at 100% accuracy baseline, with six dialect variants (Chicano, Appalachian, Southern, African American, Indian, and Singaporean English) tested on three models (Gemma-2B, Mistral-7B, GPT-4o-mini). Dialects range from 0.5 percentage points to 21.6 percentage points worse than SAE performance, with Singaporean English consistently showing the largest degradation across datasets.

November 6, 2025 at 12:08 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news