Negar Foroutan
negarforoutan.bsky.social
Negar Foroutan
@negarforoutan.bsky.social
#NLProc PhD Student at EPFL
arxiv.org
August 11, 2025 at 12:28 PM
In short, Parity-aware BPE=minimal overhead+clear fairness gains. If you care about multilingual robustness, tokenization is low-hanging fruit.
Joint work with Clara Meister, @debjit-paul.bsky.social @joelniklaus.bsky.social @sinaahmadi.bsky.social @abosselut.bsky.social @ricosennrich.bsky.social
August 11, 2025 at 12:28 PM
What’s even more exciting: low- and medium-resource languages benefit the most. We see better vocabulary utilization and compression rates for these languages, highlighting the effectiveness of our approach in providing fairer language allocation.
August 11, 2025 at 12:28 PM
Empirical results: Gini coefficient of tokenizer disparity (0 indicates a tokenizer's compression rates across languages are equal) improves by ~83% with global compression remaining very similar. On downstream task accuracy, improvements outnumber declines across configurations
August 11, 2025 at 12:28 PM
It’s a drop-in replacement in existing systems that introduces minimal training-time overhead: if you already use a BPE tokenizer, formats and tokenization/detokenization at inference are unchanged. You just need language-labeled multilingual corpora and a multi-parallel dev set.
August 11, 2025 at 12:28 PM
What changes from classical BPE? Only a small part of training. We compute frequency stats per language → when choosing the next merge, we pick it from the stats of the language with the worst compression rate, rather than from global stats. Everything else stays the same!
August 11, 2025 at 12:28 PM
What’s your take on integrating AI into education while maintaining rigor? 🤔
Check out the paper for the key findings and join the discussion on AI’s place in higher education: t.co/tJ8Gg1FRCy
https://www.pnas.org/doi/full/10.1073/pnas.2414955121
t.co
December 5, 2024 at 10:20 AM
INCLUDE evaluates how well LLMs grasp regional knowledge—local customs, culture, and info users actually need.
With ~200K questions from 52 countries, it's time to build AI that truly includes 🤗

📄Check out our paper for more details:
arxiv.org/abs/2411.19799
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI ...
arxiv.org
December 2, 2024 at 9:25 PM
✋🏻
November 28, 2024 at 11:08 AM
✋🏻
November 28, 2024 at 11:08 AM
✋🏻
November 28, 2024 at 11:08 AM
✋🏻
November 28, 2024 at 11:07 AM