andrea wang
banner
andreawwenyi.bsky.social
andrea wang
@andreawwenyi.bsky.social
phd @ cornell infosci

https://andreawwenyi.github.io
China is a nation with over a hundred minority languages and many ethnic groups. What does this say about China’s 21st century AI policy?
April 9, 2025 at 8:28 PM
This suggests a break from China’s past stance of using inclusive language policy as a way to build a multiethnic nation. We see no evidence of socio-political pressure or carrots for Chinese AI groups to dedicate resources for linguistic inclusivity.
April 9, 2025 at 8:28 PM
In fact, many LLMs from China fail to even recognize some lower resource Chinese languages such as Uyghur.
April 9, 2025 at 8:28 PM
LLMs from China are highly correlated with Western LLMs in multilingual performance (0.93 - 9.99) on tasks such as reading comprehension.
April 9, 2025 at 8:28 PM
Hi! Yess! Paper is here — aclanthology.org/2023.emnlp-m...
March 28, 2024 at 2:49 AM
February 21, 2024 at 3:59 PM
Lots of exciting open questions from this work, e.g. 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.
February 21, 2024 at 3:59 PM
Embedding geometries are similar across model families and scales, as measured by canonical angles. XLM-R models are extremely similar to each other, as well as mT5-small and base. All models are far from random (0.14–0.27).
February 21, 2024 at 3:59 PM
The diversity of neighborhoods in mT5 varies by category. For tokens in two Japanese writing systems: KATAKANA, for words of foreign origin, has more diverse neighbors than HIRAGANA, used for native Japanese words.
February 21, 2024 at 3:59 PM
The nearest neighbors of mT5 tokens are often translations. NLP spent 10 years trying to make word embeddings align across languages. mT5 embeddings find cross-lingual semantic alignment without even being asked!
February 21, 2024 at 3:58 PM
mT5 embeddings neighborhoods are more linguistically diverse: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, compared to 1.64 with XLM-R embedding.
February 21, 2024 at 3:58 PM