If you’re attending the conference, don’t miss the chance to explore our work and connect with our team.
If you’re attending the conference, don’t miss the chance to explore our work and connect with our team.
🔬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
🔬We brought the rigor from Machine Translation evaluation to multilingual LLM benchmarking and organized the WMT25 Multilingual Instruction Shared Task spanning 30 languages and 5 subtasks.
What if we optimized prompts instead of completions?
That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data🗣️
What if we optimized prompts instead of completions?
That’s the focus of our most recent work on prompt space optimization for multilingual synthetic data🗣️
Trained on 15T tokens in 1,000+ languages, it’s built for transparency, responsibility & the public good.
Read more: actu.epfl.ch/news/apertus...
💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social
Side note: being a parent in research does have its perks, poster transportation solved ✅
💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social
Side note: being a parent in research does have its perks, poster transportation solved ✅
And we’re hiring a Senior Research Scientist to co-create with us.
If you believe in research as a shared, global effort — this is your chance.
And we’re hiring a Senior Research Scientist to co-create with us.
If you believe in research as a shared, global effort — this is your chance.
Check out our latest work that builds on this insight. 👇
Introducing Fusion-of-N: a simple and powerful way to advance inference and distillation beyond Best-of-N.
Check out our latest work that builds on this insight. 👇
Entry points matter.
We started the Scholars Program 3 years ago to give new researchers a real shot — excited to open applications for year 4✨
This is your chance to collaborate with some of the brightest minds in AI & chart new courses in ML research. Let's change the spaces breakthroughs happen.
Apply by Aug 29.
Entry points matter.
We started the Scholars Program 3 years ago to give new researchers a real shot — excited to open applications for year 4✨
New post in collaboration with AI Singapore explores why Elo falls short for AI leaderboards and how we can do better.
New post in collaboration with AI Singapore explores why Elo falls short for AI leaderboards and how we can do better.
Financial Assistance Application -- docs.google.com/forms/d/e/1F...
Volunteer Application -- docs.google.com/forms/d/e/1F...
Childcare Financial Assistance Application -- docs.google.com/forms/d/e/1F...
All due by July 31
Financial Assistance Application -- docs.google.com/forms/d/e/1F...
Volunteer Application -- docs.google.com/forms/d/e/1F...
Childcare Financial Assistance Application -- docs.google.com/forms/d/e/1F...
All due by July 31
Turns out that standard methods miss out on gains on non-English languages. We propose more robust alternatives.
Very proud of this work that our scholar Ammar led! 🚀
Our latest work introduces a new inference time scaling recipe that is sample-efficient, multilingual, and suitable for multi-task requirements. 🍋
Turns out that standard methods miss out on gains on non-English languages. We propose more robust alternatives.
Very proud of this work that our scholar Ammar led! 🚀
What's the current stage and how to progress from here?
This work led by @yongzx.bsky.social has answers! 👇
📏 Our comprehensive survey reveals that there is still a long way to go.
What's the current stage and how to progress from here?
This work led by @yongzx.bsky.social has answers! 👇
Answers 👇
Our latest paper draws on our multi-year efforts with the wider research community to explore why this matters and how we can bridge the AI language gap.
Answers 👇
New preprint!
@yongzx.bsky.social has all the details 👇
We observe that reasoning language models finetuned only on English data are capable of zero-shot cross-lingual reasoning through a "quote-and-think" pattern.
However, this does not mean they reason the same way across all languages or in new domains.
[1/N]
New preprint!
@yongzx.bsky.social has all the details 👇
So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?
We took a deep dive into Chatbot Arena to find out. 🧵
So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?
We took a deep dive into Chatbot Arena to find out. 🧵
🍬Bonus question: how can we spur research on evaluation of evaluations?
Our new paper explores how we can improve generative evaluations for mLLMs by learning from machine translation (MT) evaluation practices. 🔎
🍬Bonus question: how can we spur research on evaluation of evaluations?
In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
📌 Most VLM benchmarks are English-centric or rely on translations—missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal 👀 VLMs evaluation
📌 Most VLM benchmarks are English-centric or rely on translations—missing linguistic & cultural nuance. Kaleidoscope expands in-language multilingual 🌎 & multimodal 👀 VLMs evaluation
Are you excited about multilingual evaluation, human judgment, or meta-eval? Come help us explore how a rigorous eval really looks like while questioning the status quo in LLM evaluation.
I’m looking for an intern (EU timezone preferred), are you interested? Ping me!
Are you excited about multilingual evaluation, human judgment, or meta-eval? Come help us explore how a rigorous eval really looks like while questioning the status quo in LLM evaluation.
I’m looking for an intern (EU timezone preferred), are you interested? Ping me!
💎: cohere.com/research/pap...
💎: cohere.com/research/pap...
Abstracts due March 22 AoE (+48hr)
Full papers due March 28 AoE (+24hr)
Plz RT 🙏
Abstracts due March 22 AoE (+48hr)
Full papers due March 28 AoE (+24hr)
Plz RT 🙏
🔔Also, a reminder to create your Open review profile if you haven't already. Non-institutional accounts require a verification process that can take time. One week till the abstract deadline!
March 12: dateful.com/eventlink/14...
March 13: dateful.com/eventlink/83...
Plz RT 🙏
🔔Also, a reminder to create your Open review profile if you haven't already. Non-institutional accounts require a verification process that can take time. One week till the abstract deadline!