athiyadeviyani.github.io
When picking metrics, don’t rely on global scores alone.
🎯 Identify the evaluation context
🔍 Measure local accuracy
✅ Choose metrics that are stable and/or perform well in your context
♻️ Reevaluate as models and tasks evolve
📄 aclanthology.org/2025.finding...
#NAACL2025
(🧵9/9)
When picking metrics, don’t rely on global scores alone.
🎯 Identify the evaluation context
🔍 Measure local accuracy
✅ Choose metrics that are stable and/or perform well in your context
♻️ Reevaluate as models and tasks evolve
📄 aclanthology.org/2025.finding...
#NAACL2025
(🧵9/9)
✅ H1 supported: Local accuracy still changes.
❌ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.
(🧵8/9)
✅ H1 supported: Local accuracy still changes.
❌ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.
(🧵8/9)
✅ H1 supported: Local accuracy varies a lot across systems and algorithms.
✅ H2 supported: Metric rankings shift between contexts.
🚨 Picking a metric based purely on global performance is risky!
Choose wisely. 🧙🏻♂️
(🧵7/9)
✅ H1 supported: Local accuracy varies a lot across systems and algorithms.
✅ H2 supported: Metric rankings shift between contexts.
🚨 Picking a metric based purely on global performance is risky!
Choose wisely. 🧙🏻♂️
(🧵7/9)
📝 Machine Translation (MT)
🎙 Automatic Speech Recognition (ASR)
📈 Ranking
We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!
(🧵6/9)
📝 Machine Translation (MT)
🎙 Automatic Speech Recognition (ASR)
📈 Ranking
We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!
(🧵6/9)
🧪H1: The absolute local accuracy of a metric changes as the context changes
🧪H2: The relative local accuracy (how metrics rank against each other) also changes across contexts
(🧵5/9)
🧪H1: The absolute local accuracy of a metric changes as the context changes
🧪H2: The relative local accuracy (how metrics rank against each other) also changes across contexts
(🧵5/9)
We create y′ using perturbations that simulate realistic degradations automatically.
(🧵4/9)
We create y′ using perturbations that simulate realistic degradations automatically.
(🧵4/9)
🌍 Global accuracy averages this over all outputs.
🔎 Local accuracy zooms in on a specific context (like a model, domain, or quality level).
Contexts are just meaningful slices of your data.
(🧵3/9)
🌍 Global accuracy averages this over all outputs.
🔎 Local accuracy zooms in on a specific context (like a model, domain, or quality level).
Contexts are just meaningful slices of your data.
(🧵3/9)
We introduce ✨local metric accuracy✨ to show how metric reliability can vary across settings.
(🧵2/9)
We introduce ✨local metric accuracy✨ to show how metric reliability can vary across settings.
(🧵2/9)