Marzieh Fadaee
banner
mziizm.bsky.social
Marzieh Fadaee
@mziizm.bsky.social
seeks to understand language.

Head of Cohere Labs
@Cohere_Labs @Cohere
PhD from @UvA_Amsterdam

https://marziehf.github.io/
I'm excited to share that I'll be stepping into the role of Head of @cohereforai.bsky.social. It's an honor and a responsibility to lead such an extraordinary group of researchers pushing the boundaries of AI research.
September 5, 2025 at 5:26 PM
ACL day 2 ✨
July 29, 2025 at 6:55 AM
One of my favorite parts of NeoBabel is multilingual inpainting & extrapolation: you can mask part of an image generated in language A, prompt it in language B, and it fills in the scene naturally—no special tuning needed.
July 9, 2025 at 1:27 PM
We used a multistage training setup: starting from class-label grounding, then scaling up to massive multilingual image-text pairs, and finally instruction tuning with high-res, diverse prompts.

This helped the model gradually learn structure, language, and fine-grained control.
July 9, 2025 at 1:27 PM
We put a lot of effort into building a clean, well-aligned multilingual dataset (124M image-text pairs across 6 languages) and it paid off.

NeoBabel generates well in every language. And it’s only 2B params—beating much larger models on benchmarks.
July 9, 2025 at 1:27 PM
This was my first project in image generation, and coming from language I was shocked at how little care is given to text quality in many vision datasets.
Captions are often noisy, shallow, or poorly formatted.
July 9, 2025 at 1:27 PM
🖼️ Most text-to-image models only really work in English.
This limits who can use them and whose imagination they reflect.

We asked: can we build a small, efficient model that understands prompts in multiple languages natively?
July 9, 2025 at 1:27 PM
Everyone talks about GEB (I agree, it's a gem) but Hofstadter's Analogy book is criminally underrated. If you're working on learning intelligence through language understanding, it’s a must-read.
June 29, 2025 at 10:12 AM
London has me under its spell. every. single. visit.
June 9, 2025 at 5:47 PM
7/ 🛠️ We offer 5 concrete fixes:

-Prohibit post-submission score retraction
-Limit private variants per provider
-Deprecate models equitably
-Ensure fair sampling across providers
-Publicly log all model removals
April 30, 2025 at 12:53 PM
6/ ✨ Arena's new prompts aren't as fresh and unseen as expected.

While using Arena-style data in training boosts win rates by 112%, this improvement doesn't transfer to tasks like MMLU, indicating overfitting to Arena's quirks rather than general performance gains.
April 30, 2025 at 12:53 PM
5/ 🧮 Who gets the data?

Google & OpenAI received ~40% of all Arena battle data. In contrast, 83 open-weight models collectively got <30%. This open and free benchmark disproportionately benefits private providers.
April 30, 2025 at 12:53 PM
4/ 🧹 Model deprecation is silent and skewed.

205 models were silently removed, many of them open. This breaks the assumptions of Arena’s Bradley-Terry scoring algorithm when prompt types change over time, making the leaderboard fragile and biased.
April 30, 2025 at 12:53 PM
3/ 🔒 Disproportionate private testing skews the game.

Our simulations show that a weaker model family can outrank a stronger one by testing more variants and publishing the top performer.
April 30, 2025 at 12:53 PM
2/ 🧪 With theory, simulations, and real-world experiments, we stress-test Arena’s fairness and found:

- Undisclosed private model testing warps results

- Silent model deprecation undermines rank stability

- Data access disparities between providers that enable overfitting
April 30, 2025 at 12:53 PM
1/ Science is only as strong as the benchmarks it relies on.

So how fair—and scientifically rigorous—is today’s most widely used evaluation benchmark?

We took a deep dive into Chatbot Arena to find out. 🧵
April 30, 2025 at 12:53 PM
Not in Singapore for #ICLR2025 but our lab’s work is! In particular, I am very proud of these collaborations:

✨INCLUDE (spotlight) — models fail to grasp regional nuances across languages

💎To Code or Not to Code (poster) — code is key for generalizing beyond coding tasks
April 22, 2025 at 8:15 AM
This isn’t just another benchmark. Kaleidoscope exposes critical gaps in today’s VLMs—especially for low-resource languages and vision+text questions.

Time to move beyond English-centric evaluation. 🔥
April 10, 2025 at 7:52 PM
Very excited to release Kaleidoscope—a multilingual, multimodal evaluation set for VLMs, built as part of our open-science initiative!

🌍 18 languages (high-, mid-, low-)
📚 21k questions (55% require image understanding)
🧪 STEM, social science, reasoning, and practical skills
April 10, 2025 at 7:52 PM
Good morning Paris
March 12, 2025 at 8:34 AM
✨👓 Aya Vision is here 👓✨

A multilingual, multimodal model designed to understand across languages and modalities (text, images, etc) to bridge the language gap and empower global users!
March 4, 2025 at 5:11 PM
This #Neurips2024 was the perfect way to end this year. So long Vancouver!
December 16, 2024 at 9:47 PM
Day 2 #neurips2024, let the fun officially begin.

On a separate note, as much as I love Amsterdam I'm mountain-deprived and only have eyes for this glorious view this week.
December 11, 2024 at 5:30 PM
Good morning Vancouver! It's a lovely day to see old friends and make new ones.
December 10, 2024 at 3:51 PM
This was a fantastic collaboration with @agromanou.bsky.social @abosselut.bsky.social
and the research team at EPFL.

Check out the paper here: arxiv.org/abs/2411.19799
and the benchmarks here:
hf.co/datasets/Coh...
hf.co/datasets/Coh...
December 3, 2024 at 12:27 PM