Angelika Romanou
banner
agromanou.bsky.social
Angelika Romanou
@agromanou.bsky.social
PhD candidate at EPFL doing research in #NLProc
👩🏻‍💻 https://agromanou.github.io/
🙋🏻‍♀️
December 2, 2024 at 4:29 PM
👏 As well as the fantastic multilingual research community that helped us collect and validate INCLUDE!
December 2, 2024 at 3:53 PM
🙏 We thank our amazing core team and advisors:
@negarforoutan.bsky.social, Anna Sotnikova, @eric-zemingchen.bsky.social, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A Haggag, Imanol Schlag, @mziizm.bsky.social, @sarahooker.bsky.social, @abosselut.bsky.social
December 2, 2024 at 3:53 PM
For easy evaluation, we provide the following subsets:
INCLUDE-base: up to 550 samples per language, totaling ~23K questions
🤗 : huggingface.co/datasets/Coh...
INCLUDE-lite: up to 250 samples per language, totaling ~11K questions
🤗 : huggingface.co/datasets/Coh...
December 2, 2024 at 3:53 PM
🤝 Information is transferred across languages of the same script, though untrained languages might also excel due to potential data contamination.

🌎 Models can struggle with non-English instructions, entangling knowledge evaluation with other factors such as task formatting.
December 2, 2024 at 3:53 PM
Analysis shows:
📚 Models have a long way to go in capturing the regional knowledge reflected in languages.

💪 Model scale improves regional knowledge understanding, but other techniques like CoT or instruction tuning have minimal or negative impacts.
December 2, 2024 at 3:53 PM
To build INCLUDE, we collected ~200K MCQ data from 44 languages and 58 knowledge domains, collected from local sources in 52 countries, representing a rich array of cultural and regional knowledge.
December 2, 2024 at 3:53 PM
🤔 Why is regional knowledge so important?

Users expect #LLMs to know information relevant to their environments— customs, culture, etc.
To be relevant & relatable, LLMs need to know these nuances. It's not just global knowledge; it's about meeting user needs where they are.
December 2, 2024 at 3:53 PM
🌍 First, what is regional knowledge?

It's the local info, culture & practices of a regional context. US Law is a great topic, but not as relevant for multilingual LLMs for other regions.

For INCLUDE, we collect regional knowledge rather than translating Western-centric benchmarks.
December 2, 2024 at 3:53 PM
🙋🏻‍♀️
November 26, 2024 at 4:58 PM
🙋🏻‍♀️
November 23, 2024 at 10:35 AM
🙋🏻‍♀️
November 22, 2024 at 5:49 PM