Catherine Arnett
@catherinearnett.bsky.social
NLP Researcher at EleutherAI, PhD UC San Diego Linguistics.
Previously PleIAs, Edinburgh University.
Interested in multilingual NLP, tokenizers, open science.
📍Boston. She/her.
https://catherinearnett.github.io/
Previously PleIAs, Edinburgh University.
Interested in multilingual NLP, tokenizers, open science.
📍Boston. She/her.
https://catherinearnett.github.io/
Pinned
Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
I’m so excited that Global PIQA is out! This has been a herculean effort by our 300+ contributors. The result is an extremely high-quality, culturally-specific benchmark for over 100 languages.
Reposted by Catherine Arnett
We have kicked off proceedings with some brief opening remarks from @catherinearnett.bsky.social
November 9, 2025 at 1:26 AM
We have kicked off proceedings with some brief opening remarks from @catherinearnett.bsky.social
Reposted by Catherine Arnett
🚨 EvalEval is back - now in San Diego!🚨
🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025
Details below! ⬇️
evalevalai.com/events/works...
🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025
Details below! ⬇️
evalevalai.com/events/works...
evalevalai.com
November 6, 2025 at 9:19 PM
🚨 EvalEval is back - now in San Diego!🚨
🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025
Details below! ⬇️
evalevalai.com/events/works...
🧠 Join us for the 2025 Workshop on "Evaluating AI in Practice Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries" (Co-hosted with UKAISI)
📅 Dec 8, 2025
📝 Abstract due: Nov 20, 2025
Details below! ⬇️
evalevalai.com/events/works...
I’m so excited that Global PIQA is out! This has been a herculean effort by our 300+ contributors. The result is an extremely high-quality, culturally-specific benchmark for over 100 languages.
Introducing Global PIQA, a new multilingual benchmark for 100+ languages. This benchmark is the outcome of this year’s MRL shared task, in collaboration with 300+ researchers from 65 countries. This dataset evaluates physical commonsense reasoning in culturally relevant contexts.
October 29, 2025 at 3:53 PM
I’m so excited that Global PIQA is out! This has been a herculean effort by our 300+ contributors. The result is an extremely high-quality, culturally-specific benchmark for over 100 languages.
Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!
October 28, 2025 at 3:11 PM
Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!
Reposted by Catherine Arnett
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
October 10, 2025 at 4:18 PM
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
Reposted by Catherine Arnett
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
October 9, 2025 at 8:17 PM
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025
October 6, 2025 at 9:30 PM
I’m in Montreal this week for @colmweb.org and @wmdqs.bsky.social! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
September 25, 2025 at 3:14 PM
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
Did you know?
❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English
In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇
❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English
In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇
September 19, 2025 at 2:53 PM
Did you know?
❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English
In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇
❌77% of language models on @hf.co are not tagged for any language
📈For 95% of languages, most models are multilingual
🚨88% of models with tags are trained on English
In a new blog post, @tylerachang.bsky.social and I dig into these trends and why they matter! 👇
Reposted by Catherine Arnett
We are in need of some emergency reviewers for MRL. If you are available, please fill out this form!
If you would like to sign up to be a reviewer, please fill in this form: t.co/fbunVuVhdE
https://forms.gle/fbizvGghD33cP3HP7
t.co
September 12, 2025 at 6:31 PM
We are in need of some emergency reviewers for MRL. If you are available, please fill out this form!
Reposted by Catherine Arnett
We extended the deadline by one day, so you have until the end of today (Aug 24) AoE to submit! Good luck!
The deadline for MRL at #EMNLP2025 is next week!
⏰ Submission Deadline: August 23rd (AoE)
🔗 CfP: sigtyp.github.io/ws2025-mrl.h...
⏰ Submission Deadline: August 23rd (AoE)
🔗 CfP: sigtyp.github.io/ws2025-mrl.h...
The submission deadline for the 5th Workshop on Multilingual Representation Learning is coming up! See details below!
August 24, 2025 at 10:08 PM
We extended the deadline by one day, so you have until the end of today (Aug 24) AoE to submit! Good luck!
Reposted by Catherine Arnett
We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!
August 18, 2025 at 3:53 PM
We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!
Reposted by Catherine Arnett
With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!
August 5, 2025 at 3:13 PM
With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you don’t see your language represented on the map, this is your sign to get involved!
I’m in Vienna all week for @aclmeeting.bsky.social and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!
✨New pre-print✨ Crosslingual transfer allows models to leverage their representations for one language to improve performance on another language. We characterize the acquisition of shared representations in order to better understand how and when crosslingual transfer happens.
July 27, 2025 at 3:29 PM
I’m in Vienna all week for @aclmeeting.bsky.social and I’ll be presenting this paper on Wednesday at 11am (Poster Session 4 in HALL X4 X5)! Reach out if you want to chat about multilingual NLP, tokenizers, and open models!
Reposted by Catherine Arnett
If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬
Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/
Deadline: July 23, 2025 (AoE) ⏰
Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/
Deadline: July 23, 2025 (AoE) ⏰
July 21, 2025 at 10:40 PM
If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬
Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/
Deadline: July 23, 2025 (AoE) ⏰
Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/
Deadline: July 23, 2025 (AoE) ⏰
Really grateful to the organizers for the recognition of our work!
🏆 Announcing our Best Paper Awards!
🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO7...
🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4...
Congrats! 🎉
🥇 Winner: "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" openreview.net/forum?id=AO7...
🥈 Runner-up: "One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression" openreview.net/forum?id=lC4...
Congrats! 🎉
July 19, 2025 at 1:55 PM
Really grateful to the organizers for the recognition of our work!
I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!
July 10, 2025 at 4:13 PM
I'll be at ICML next week for the Tokenization Workshop @tokshop.bsky.social presenting two papers:
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!
"Evaluating Morphological Alignment of Tokenizers in 70 Languages" and "BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization". Check out the paper threads below!
MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social
July 10, 2025 at 4:09 PM
MorphScore got an update! MorphScore now covers 70 languages 🌎🌍🌏 We have a new-preprint out and we will be presenting our paper at the Tokenization Workshop @tokshop.bsky.social at ICML next week! @marisahudspeth.bsky.social @brenocon.bsky.social
Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!
July 9, 2025 at 2:21 PM
Just a few days left to contribute annotations before the first release of training data. We have over 17,000 document annotations so far!
Reposted by Catherine Arnett
Stop by our discover server tomorrow, Friday June 27th, to hear about @catherinearnett.bsky.social's work!
We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.
Our first talk is by @catherinearnett.bsky.social on tokenizers, their limitations, and how to improve them.
Our first talk is by @catherinearnett.bsky.social on tokenizers, their limitations, and how to improve them.
June 26, 2025 at 6:18 PM
Stop by our discover server tomorrow, Friday June 27th, to hear about @catherinearnett.bsky.social's work!
I'm really excited about this shared task! We hope to create a massively multilingual physical reasoning dataset in collaboration with researchers around the world 🌍
As part of the workshop, we are also organizing a shared task to develop a collaborative physical commonsense reasoning evaluation dataset. See the shared task page for more information: sigtyp.github.io/st2025-mrl.h....
June 25, 2025 at 3:43 PM
I'm really excited about this shared task! We hope to create a massively multilingual physical reasoning dataset in collaboration with researchers around the world 🌍
The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!
June 24, 2025 at 4:33 PM
The call for papers is out for the 5th edition of the Workshop on Multilingual Representation Learning which will take place in Suzhou, China co-located with EMNLP 2025! See details below!
Reposted by Catherine Arnett
Reposted by Catherine Arnett
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
1st Workshop on Multilingual Data Quality Signals
wmdqs.org
May 29, 2025 at 5:18 PM
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org