Lightnews — Scholar-powered news

Catherine Arnett

@catherinearnett.bsky.social

3.9K followers 570 following 110 posts

NLP Researcher at EleutherAI, PhD UC San Diego Linguistics.
Previously PleIAs, Edinburgh University.
Interested in multilingual NLP, tokenizers, open science.
📍Boston. She/her.
https://catherinearnett.github.io/

Posts Replies Media Videos

Catherine Arnett

@catherinearnett.bsky.social

We’re still looking to expand the language coverage of Global PIQA, so sign up if you don’t see your language represented yet! bsky.app/profile/mrl-...

Multilingual Representation Workshop @ EMNLP 2025 @mrl-workshop.bsky.social · 13d

It’s not too late to get involved! Until early 2026, we will be accepting submissions for languages not already represented in Global PIQA. If you’re interested, please fill out this form and we will contact you with details!
docs.google.com/forms/d/e/1F...

Global PIQA Contributor Interest Form

Thanks for your interest in contributing to Global PIQA! Please fill out the form and we will contact you with details about how to get involved!

docs.google.com

October 29, 2025 at 3:53 PM

Catherine Arnett

@catherinearnett.bsky.social

You can read the preprint here: arxiv.org/abs/2510.21909
We release the tokenizers on Hugging Face: huggingface.co/datasets/cat...

Explaining and Mitigating Crosslingual Tokenizer Inequities

The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during t...

arxiv.org

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

We found that one of the biggest predictors of token premium effects was whitespace usage. So we also trained SuperBPE tokenizers, which do not use whitespace pretokenizers. SuperBPE tokenizers demonstrate better compression and less extreme token premium effects.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

While it’s possible to achieve the same compression for some sets of languages by manipulating vocabulary size, there are some languages which changing vocab size does not lead to the same compression.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

We show that some languages need more vocabulary items to get the same compression. This suggests that multilingual tokenizers should allocate more or less vocab to different languages, which can help us design more equitable multilingual tokenizers.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

We used the compression rates we got from our monolingual tokenizers to estimate the vocabulary size at which a tokenizer would reach a target compression rate. We used this to determine the “optimal” vocab size for each language. This significantly reduces token premium effects.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

We trained 7000 monolingual tokenizers for 97 languages and a range of vocabulary sizes. There was no vocabulary size at which token premiums go away, though larger vocabularies unsurprisingly lead to better compression and slightly smaller token premiums.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

Compression isn’t the only tokenizer metric that matters, but it directly determines how many computations a model needs to process text. That affects both training and inference cost. Ideally, we want compression rates to be as similar as possible across languages.

October 28, 2025 at 3:11 PM

Catherine Arnett

@catherinearnett.bsky.social

Yeah I have many thoughts about that post. I do have a follow up post brewing 👀 probably will be some months before I finish it though!

October 23, 2025 at 1:00 AM

Catherine Arnett

@catherinearnett.bsky.social

Yeah, I think the models do generally capture this well and with a lot of flexibility. I think when people have done morphological tokenization, it tends to be really rigid and fragile to anything OOD

September 26, 2025 at 10:19 PM

Catherine Arnett

@catherinearnett.bsky.social

I guess the idea is basically to map strings of text to some kind of abstract representation of meaning and grammar? Maybe the closest thing is morphological tokenization. But to do this fully you would kind of need to solve Language first

September 26, 2025 at 9:56 PM