Lightnews — Scholar-powered news

andrea wang

@andreawwenyi.bsky.social

1.9K followers 57 following 15 posts

phd @ cornell infosci

https://andreawwenyi.github.io

Posts Replies Media Videos

andrea wang

@andreawwenyi.bsky.social

China is a nation with over a hundred minority languages and many ethnic groups. What does this say about China’s 21st century AI policy?

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

This suggests a break from China’s past stance of using inclusive language policy as a way to build a multiethnic nation. We see no evidence of socio-political pressure or carrots for Chinese AI groups to dedicate resources for linguistic inclusivity.

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

In fact, many LLMs from China fail to even recognize some lower resource Chinese languages such as Uyghur.

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

LLMs from China are highly correlated with Western LLMs in multilingual performance (0.93 - 9.99) on tasks such as reading comprehension.

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

Hi! Yess! Paper is here — aclanthology.org/2023.emnlp-m...

March 28, 2024 at 2:49 AM

andrea wang

@andreawwenyi.bsky.social

Link to paper: aclanthology.org/2023.emnlp-m...

February 21, 2024 at 3:59 PM

andrea wang

@andreawwenyi.bsky.social

Lots of exciting open questions from this work, e.g. 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.

February 21, 2024 at 3:59 PM

andrea wang

@andreawwenyi.bsky.social

Embedding geometries are similar across model families and scales, as measured by canonical angles. XLM-R models are extremely similar to each other, as well as mT5-small and base. All models are far from random (0.14–0.27).

February 21, 2024 at 3:59 PM

andrea wang

@andreawwenyi.bsky.social

The diversity of neighborhoods in mT5 varies by category. For tokens in two Japanese writing systems: KATAKANA, for words of foreign origin, has more diverse neighbors than HIRAGANA, used for native Japanese words.

February 21, 2024 at 3:59 PM

andrea wang

@andreawwenyi.bsky.social

The nearest neighbors of mT5 tokens are often translations. NLP spent 10 years trying to make word embeddings align across languages. mT5 embeddings find cross-lingual semantic alignment without even being asked!

February 21, 2024 at 3:58 PM

andrea wang

@andreawwenyi.bsky.social

mT5 embeddings neighborhoods are more linguistically diverse: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, compared to 1.64 with XLM-R embedding.

February 21, 2024 at 3:58 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news