Julia Kreutzer
@juliakreutzer.bsky.social
NLP & ML research @cohereforai.bsky.social 🇨🇦
Ready for our poster today at #COLM2025!
💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social
Side note: being a parent in research does have its perks, poster transportation solved ✅
💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social
Side note: being a parent in research does have its perks, poster transportation solved ✅
October 8, 2025 at 12:16 PM
Ready for our poster today at #COLM2025!
💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social
Side note: being a parent in research does have its perks, poster transportation solved ✅
💭This paper has had an interesting journey, come find out and discuss with us! @swetaagrawal.bsky.social @kocmitom.bsky.social
Side note: being a parent in research does have its perks, poster transportation solved ✅
🤔Yes, none of these principles are novel or the techniques particularly sophisticated.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
April 17, 2025 at 10:56 AM
🤔Yes, none of these principles are novel or the techniques particularly sophisticated.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
Despite their effectiveness, none of them are standard practice.
✔️We’ve compiled a checklist to help incorporate them in model evaluations.
(5) Advancing reproducibility through transparency 🪟
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
April 17, 2025 at 10:56 AM
(5) Advancing reproducibility through transparency 🪟
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
Current mLLM evaluations are near impossible to reproduce, due to intransparency of evaluation configurations (incl. task formulation as in the example below). We argue for open evaluation releases that include model outputs and their scores.
(4) Conducting richer analyses 🔬
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
April 17, 2025 at 10:56 AM
(4) Conducting richer analyses 🔬
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
Aggregate benchmark metrics do not provide insights into what differentiates the outputs of two models - yet this is often the first step in human evaluation. For example, we can group evaluation prompts by length or category.
(3) Aggregating responsibly 🏗️
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).
April 17, 2025 at 10:56 AM
(3) Aggregating responsibly 🏗️
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).
How we aggregate results across tasks and languages informs the interpretation of model comparisons. Uniform weighting is not necessarily fair due to differences in training distribution (e.g. language or task support).
(2) Measuring significance, power and effect size 🔋
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.
April 17, 2025 at 10:56 AM
(2) Measuring significance, power and effect size 🔋
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.
Generative evaluations for mLLMs rarely consider significance of results, statistical power of the test setup or effect sizes. We illustrate how these can be helpful to reporting model differences more meaningfully.
(1) Treating synthetic data with care 💅
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.
April 17, 2025 at 10:56 AM
(1) Treating synthetic data with care 💅
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.
Translations are a common way to expand evaluation sets to new languages. We demonstrate that prompt translation can cause changes in win rates, with magnitudes depending on translation quality and generative models.
📖New preprint with Eleftheria Briakou @swetaagrawal.bsky.social @mziizm.bsky.social @kocmitom.bsky.social!
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
April 17, 2025 at 10:56 AM
📖New preprint with Eleftheria Briakou @swetaagrawal.bsky.social @mziizm.bsky.social @kocmitom.bsky.social!
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…
arxiv.org/abs/2504.11829
🌍It reflects experiences from my personal research journey: coming from MT into multilingual LLM research I missed reliable evaluations and evaluation research…