Catherine Arnett
banner
catherinearnett.bsky.social
Catherine Arnett
@catherinearnett.bsky.social
NLP Researcher at EleutherAI, PhD UC San Diego Linguistics.
Previously PleIAs, Edinburgh University.
Interested in multilingual NLP, tokenizers, open science.
📍Boston. She/her.
https://catherinearnett.github.io/
We’re still looking to expand the language coverage of Global PIQA, so sign up if you don’t see your language represented yet! bsky.app/profile/mrl-...
It’s not too late to get involved! Until early 2026, we will be accepting submissions for languages not already represented in Global PIQA. If you’re interested, please fill out this form and we will contact you with details!
docs.google.com/forms/d/e/1F...
Global PIQA Contributor Interest Form
Thanks for your interest in contributing to Global PIQA! Please fill out the form and we will contact you with details about how to get involved!
docs.google.com
October 29, 2025 at 3:53 PM
We found that one of the biggest predictors of token premium effects was whitespace usage. So we also trained SuperBPE tokenizers, which do not use whitespace pretokenizers. SuperBPE tokenizers demonstrate better compression and less extreme token premium effects.
October 28, 2025 at 3:11 PM
While it’s possible to achieve the same compression for some sets of languages by manipulating vocabulary size, there are some languages which changing vocab size does not lead to the same compression.
October 28, 2025 at 3:11 PM
We show that some languages need more vocabulary items to get the same compression. This suggests that multilingual tokenizers should allocate more or less vocab to different languages, which can help us design more equitable multilingual tokenizers.
October 28, 2025 at 3:11 PM
We used the compression rates we got from our monolingual tokenizers to estimate the vocabulary size at which a tokenizer would reach a target compression rate. We used this to determine the “optimal” vocab size for each language. This significantly reduces token premium effects.
October 28, 2025 at 3:11 PM
We trained 7000 monolingual tokenizers for 97 languages and a range of vocabulary sizes. There was no vocabulary size at which token premiums go away, though larger vocabularies unsurprisingly lead to better compression and slightly smaller token premiums.
October 28, 2025 at 3:11 PM
Compression isn’t the only tokenizer metric that matters, but it directly determines how many computations a model needs to process text. That affects both training and inference cost. Ideally, we want compression rates to be as similar as possible across languages.
October 28, 2025 at 3:11 PM
Yeah I have many thoughts about that post. I do have a follow up post brewing 👀 probably will be some months before I finish it though!
October 23, 2025 at 1:00 AM
Yeah, I think the models do generally capture this well and with a lot of flexibility. I think when people have done morphological tokenization, it tends to be really rigid and fragile to anything OOD
September 26, 2025 at 10:19 PM
I guess the idea is basically to map strings of text to some kind of abstract representation of meaning and grammar? Maybe the closest thing is morphological tokenization. But to do this fully you would kind of need to solve Language first
September 26, 2025 at 9:56 PM
Thanks!
September 26, 2025 at 5:58 PM