Negar Foroutan
negarforoutan.bsky.social
Negar Foroutan
@negarforoutan.bsky.social
#NLProc PhD Student at EPFL
What’s even more exciting: low- and medium-resource languages benefit the most. We see better vocabulary utilization and compression rates for these languages, highlighting the effectiveness of our approach in providing fairer language allocation.
August 11, 2025 at 12:28 PM
Empirical results: Gini coefficient of tokenizer disparity (0 indicates a tokenizer's compression rates across languages are equal) improves by ~83% with global compression remaining very similar. On downstream task accuracy, improvements outnumber declines across configurations
August 11, 2025 at 12:28 PM
🚨New Preprint!

In multilingual models, the same meaning can take far more tokens in some languages, penalizing users of underrepresented languages with worse performance and higher API costs. Our Parity-aware BPE algorithm is a step toward addressing this issue: 🧵
August 11, 2025 at 12:28 PM