Lightnews — Scholar-powered news

andrea wang

@andreawwenyi.bsky.social

1.9K followers 57 following 15 posts

phd @ cornell infosci

https://andreawwenyi.github.io

Posts Replies Media Videos

andrea wang

@andreawwenyi.bsky.social

In fact, many LLMs from China fail to even recognize some lower resource Chinese languages such as Uyghur.

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

LLMs from China are highly correlated with Western LLMs in multilingual performance (0.93 - 9.99) on tasks such as reading comprehension.

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

[New preprint!] Do Chinese AI Models Speak Chinese Languages? Not really. Chinese LLMs like DeepSeek are better at French than Cantonese. Joint work with
Unso Jo and @dmimno.bsky.social . Link to paper: arxiv.org/pdf/2504.00289
🧵

April 9, 2025 at 8:28 PM

andrea wang

@andreawwenyi.bsky.social

Embedding geometries are similar across model families and scales, as measured by canonical angles. XLM-R models are extremely similar to each other, as well as mT5-small and base. All models are far from random (0.14–0.27).

February 21, 2024 at 3:59 PM

andrea wang

@andreawwenyi.bsky.social

The diversity of neighborhoods in mT5 varies by category. For tokens in two Japanese writing systems: KATAKANA, for words of foreign origin, has more diverse neighbors than HIRAGANA, used for native Japanese words.

February 21, 2024 at 3:59 PM

andrea wang

@andreawwenyi.bsky.social

The nearest neighbors of mT5 tokens are often translations. NLP spent 10 years trying to make word embeddings align across languages. mT5 embeddings find cross-lingual semantic alignment without even being asked!

February 21, 2024 at 3:58 PM

andrea wang

@andreawwenyi.bsky.social

mT5 embeddings neighborhoods are more linguistically diverse: the 50 nearest neighbors for any token represent an average of 7.61 writing systems, compared to 1.64 with XLM-R embedding.

February 21, 2024 at 3:58 PM

andrea wang

@andreawwenyi.bsky.social

Tokens in different writing systems can be linearly separated with an average of 99.2 % accuracy for XLM. Even in high-dimensional space, mT5 embeddings are less separable.

February 21, 2024 at 3:57 PM

andrea wang

@andreawwenyi.bsky.social

XLM-R embeddings partition by language, while mT5 has more overlap. Since it's hard to identify the language of BPE tokens, we use Unicode writing systems instead. Both plots contain points for the intersection of the two models' vocabularies.

February 21, 2024 at 3:57 PM

andrea wang

@andreawwenyi.bsky.social

How do LLMs represent relationships between languages? By studying the embedding layers of XLM-R and mT5, we find they are highly interpretable. LLMs can find semantic alignment as an emergent property! Joint work with @dmimno.bsky.social. 🧵

February 21, 2024 at 3:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news