Head of Cohere Labs
@Cohere_Labs @Cohere
PhD from @UvA_Amsterdam
https://marziehf.github.io/
This helped the model gradually learn structure, language, and fine-grained control.
This helped the model gradually learn structure, language, and fine-grained control.
NeoBabel generates well in every language. And it’s only 2B params—beating much larger models on benchmarks.
NeoBabel generates well in every language. And it’s only 2B params—beating much larger models on benchmarks.
Captions are often noisy, shallow, or poorly formatted.
Captions are often noisy, shallow, or poorly formatted.
This limits who can use them and whose imagination they reflect.
We asked: can we build a small, efficient model that understands prompts in multiple languages natively?
This limits who can use them and whose imagination they reflect.
We asked: can we build a small, efficient model that understands prompts in multiple languages natively?
-Prohibit post-submission score retraction
-Limit private variants per provider
-Deprecate models equitably
-Ensure fair sampling across providers
-Publicly log all model removals
-Prohibit post-submission score retraction
-Limit private variants per provider
-Deprecate models equitably
-Ensure fair sampling across providers
-Publicly log all model removals
While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.
While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.
Google & OpenAI received ~40% of all Arena battle data. In contrast, 83 open-weight models collectively got <30%. This open and free benchmark disproportionately benefits private providers.
Google & OpenAI received ~40% of all Arena battle data. In contrast, 83 open-weight models collectively got <30%. This open and free benchmark disproportionately benefits private providers.
205 models were silently removed, many of them open. This breaks the assumptions of Arena’s Bradley-Terry scoring algorithm when prompt types change over time, making the leaderboard fragile and biased.
205 models were silently removed, many of them open. This breaks the assumptions of Arena’s Bradley-Terry scoring algorithm when prompt types change over time, making the leaderboard fragile and biased.
Our simulations show that a weaker model family can outrank a stronger one by testing more variants and publishing the top performer.
Our simulations show that a weaker model family can outrank a stronger one by testing more variants and publishing the top performer.
- Undisclosed private model testing warps results
- Silent model deprecation undermines rank stability
- Data access disparities between providers that enable overfitting
- Undisclosed private model testing warps results
- Silent model deprecation undermines rank stability
- Data access disparities between providers that enable overfitting
So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?
We took a deep dive into Chatbot Arena to find out. 🧵
So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?
We took a deep dive into Chatbot Arena to find out. 🧵
✨INCLUDE (spotlight) — models fail to grasp regional nuances across languages
💎To Code or Not to Code (poster) — code is key for generalizing beyond coding tasks
✨INCLUDE (spotlight) — models fail to grasp regional nuances across languages
💎To Code or Not to Code (poster) — code is key for generalizing beyond coding tasks
Time to move beyond English-centric evaluation. 🔥
Time to move beyond English-centric evaluation. 🔥
🌍 18 languages (high-, mid-, low-)
📚 21k questions (55% require image understanding)
🧪 STEM, social science, reasoning, and practical skills
🌍 18 languages (high-, mid-, low-)
📚 21k questions (55% require image understanding)
🧪 STEM, social science, reasoning, and practical skills
A multilingual, multimodal model designed to understand across languages and modalities (text, images, etc) to bridge the language gap and empower global users!
A multilingual, multimodal model designed to understand across languages and modalities (text, images, etc) to bridge the language gap and empower global users!
On a separate note, as much as I love Amsterdam I'm mountain-deprived and only have eyes for this glorious view this week.
On a separate note, as much as I love Amsterdam I'm mountain-deprived and only have eyes for this glorious view this week.
and the research team at EPFL.
Check out the paper here: arxiv.org/abs/2411.19799
and the benchmarks here:
hf.co/datasets/Coh...
hf.co/datasets/Coh...
and the research team at EPFL.
Check out the paper here: arxiv.org/abs/2411.19799
and the benchmarks here:
hf.co/datasets/Coh...
hf.co/datasets/Coh...