Athiya Deviyani
banner
athiya.bsky.social
Athiya Deviyani
@athiya.bsky.social
LTI PhD at CMU on evaluation and trustworthy ML/NLP, prev AI&CS Edinburgh University, Google, YouTube, Apple, Netflix. Views are personal 👩🏻‍💻🇮🇩

athiyadeviyani.github.io
Thank you for the repost 🤗
April 29, 2025 at 6:11 PM
🔑 So what now?
When picking metrics, don’t rely on global scores alone.
🎯 Identify the evaluation context
🔍 Measure local accuracy
✅ Choose metrics that are stable and/or perform well in your context
♻️ Reevaluate as models and tasks evolve

📄 aclanthology.org/2025.finding...
#NAACL2025

(🧵9/9)
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
Athiya Deviyani, Fernando Diaz. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.
aclanthology.org
April 29, 2025 at 5:10 PM
For ASR:
✅ H1 supported: Local accuracy still changes.
❌ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.

(🧵8/9)
April 29, 2025 at 5:10 PM
Here’s what we found for MT and Ranking:
✅ H1 supported: Local accuracy varies a lot across systems and algorithms.
✅ H2 supported: Metric rankings shift between contexts.

🚨 Picking a metric based purely on global performance is risky!

Choose wisely. 🧙🏻‍♂️

(🧵7/9)
April 29, 2025 at 5:10 PM
We evaluate this framework across three tasks:
📝 Machine Translation (MT)
🎙 Automatic Speech Recognition (ASR)
📈 Ranking

We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!

(🧵6/9)
April 29, 2025 at 5:10 PM
We test two hypotheses:
🧪H1: The absolute local accuracy of a metric changes as the context changes
🧪H2: The relative local accuracy (how metrics rank against each other) also changes across contexts

(🧵5/9)
April 29, 2025 at 5:10 PM
More formally: given an input x, an output y from a context c, and a degraded version y′, we ask: how often does the metric score y higher than y′ across all inputs in the context c?

We create y′ using perturbations that simulate realistic degradations automatically.

(🧵4/9)
April 29, 2025 at 5:10 PM
🎯 Metric accuracy measures how often a metric picks the better system output.
🌍 Global accuracy averages this over all outputs.
🔎 Local accuracy zooms in on a specific context (like a model, domain, or quality level).

Contexts are just meaningful slices of your data.

(🧵3/9)
April 29, 2025 at 5:10 PM
Most meta-evaluations look at global performance over arbitrary outputs. However, real-world use cases are highly contextual, tied to specific models or output qualities.

We introduce ✨local metric accuracy✨ to show how metric reliability can vary across settings.

(🧵2/9)
April 29, 2025 at 5:10 PM
🙋‍♀️
November 18, 2024 at 4:05 AM