Lightnews — Scholar-powered news

Allison Koenecke

@allisonkoe.bsky.social

2.6K followers 280 following 29 posts

asst prof @ cornell info sci | fairness in tech, public health & services | alum of MSR, Stanford ICME, NERA Econ, MIT Math | she/her | koenecke.infosci.cornell.edu

Posts Replies Media Videos

Allison Koenecke

@allisonkoe.bsky.social

You've been too busy 🀄izing bias in other contexts!

June 22, 2025 at 9:24 PM

Allison Koenecke

@allisonkoe.bsky.social

Many thanks to the researchers who have inspired our work!! (14/14) @valentinhofmann.bsky.social @jurafsky.bsky.social @haldaume3.bsky.social @hannawallach.bsky.social @jennwv.bsky.social @diyiyang.bsky.social and many others not yet on Bluesky!

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

We encourage practitioners to use our dataset (github.com/brucelyu17/S...) to audit for biases before choosing an LLM to use, and developers to investigate diversifying training data and research tokenization differences across Chinese variants. (13/14)

GitHub - brucelyu17/SC-TC-Bench: [FAccT '25] Characterizing Bias: Benchmarking LLMs in Simplified versus Traditional Chinese

[FAccT '25] Characterizing Bias: Benchmarking LLMs in Simplified versus Traditional Chinese - brucelyu17/SC-TC-Bench

github.com

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

This is likely due to differences in tokenization between Simplified Chinese and Traditional Chinese. The exact same names, when translated between language settings, result in significantly different numbers of tokens when represented in each of the models. (12/14)

Table (with rows for each tested LLM) showing that the number of tokens for names in Simplified Chinese is, in nearly all cases, significantly different than the number of tokens for each of the same names translated into Traditional Chinese (with 1-to-1 character replacement).

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

But, written character choice (in Traditional or Simplified) seems to be the primary driver of LLM preferences. Conditioning on the same names (which have different characters in Traditional vs. Simplified), we can flip our results & get majority Simplified names selected (11/14)

Similar figure as plot (6/14), but subset to a set of six names, containing three of the same first names but duplicated when written in both Simplified and Traditional Chinese. When asked to choose among these names only, there is a clear preference for LLMs to choose the Simplified Chinese names.

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

(3) Some LLMs prefer certain characters, like 俊 and 宇, which are more common in Taiwanese names. Baichuan-2 often describes selected Taiwanese names as having qualities related to “talent” and “wisdom.” This does seem like a partial explanation! (10/14)

Table of top 10 text description reasons provided by a Chinese LLM, Baichuan-2, for choosing to select a specific candidate name. Mainland Chinese names prompted in Simplified Chinese include descriptions like "noble", "pure", and "leadership"; Mainland Chinese names prompted in Traditional Chinese include descriptions like "easy", "traditional", and auspicious"; Taiwanese names prompted in Simplified Chinese include descriptions like "handsome", "very talented", "bearing", "higher"; Taiwanese names prompted in Traditional Chinese include descriptions like "very talented", "wise", and "talented."

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

(2) Gender bias exists: male names are selected more frequently than female names in almost all LLMs. But, balancing our experiments on gender still yields a slight preference for Taiwanese names. (9/14)

Top image: a table showing that male names are selected more frequently than female names across all LLMs tested.
Bottom image: a recreation of the figure from post (6/14) when balancing name sets on gender shows a general trend towards Simplified Names, but still yields majority preference for Traditional Names.

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

(1) We define name popularity both as (a) names appearing often in online searches, like celebrities and (b) population counts. Controlling for either definition doesn’t affect LLM preference for Taiwanese names. (8/14)

Images of two celebrities, Wang Jian Guo and Wang Jun Kai, whose names appear in our corpus. LLMs do not disproportionately select these candidates' names.

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

Why are we seeing this preference for Taiwanese names among LLMs? We use process of elimination on 4 likely explanations: popularity, gender, character, and written script. (7/14)

June 22, 2025 at 9:16 PM

Allison Koenecke

@allisonkoe.bsky.social

Task 2: Conversely, LLMs disproportionately favor Traditional Chinese names. This trend holds regardless of LLM degree of adherence to prompt instructions (with some LLMs refusing to choose a candidate without sufficient info–good!, and some always returning a name) (6/14)

Figure showing that LLMs have high variance of adhering to prompt instructions, favoring Traditional Chinese names over Simplified Chinese names. Figures are dot plots (one dot per LLM) where x-axis is Rate of Valid Responses, y-axis is Mainland Chinese Name Rate (i.e. share of Simplified Chinese names selected), and three panels replicate the same chart for experiments when prompted in Simplified Chinese, Traditional Chinese, and English.

June 22, 2025 at 9:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news