Lightnews — Scholar-powered news

Athiya Deviyani

@athiya.bsky.social

LTI PhD at CMU on evaluation and trustworthy ML/NLP, prev AI&CS Edinburgh University, Google, YouTube, Apple, Netflix. Views are personal 👩🏻‍💻🇮🇩

athiyadeviyani.github.io

Posts Replies Media Videos

Athiya Deviyani

@athiya.bsky.social

Thank you for the repost 🤗

April 29, 2025 at 6:11 PM

Athiya Deviyani

@athiya.bsky.social

🔑 So what now?
When picking metrics, don’t rely on global scores alone.
🎯 Identify the evaluation context
🔍 Measure local accuracy
✅ Choose metrics that are stable and/or perform well in your context
♻️ Reevaluate as models and tasks evolve

📄 aclanthology.org/2025.finding...
#NAACL2025

(🧵9/9)

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy

Athiya Deviyani, Fernando Diaz. Findings of the Association for Computational Linguistics: NAACL 2025. 2025.

aclanthology.org

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

For ASR:
✅ H1 supported: Local accuracy still changes.
❌ H2 not supported: Metric rankings stay pretty stable.
This is probably because ASR outputs are less ambiguous, and metrics focus on similar properties, such as phonetic or lexical accuracy.

(🧵8/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

Here’s what we found for MT and Ranking:
✅ H1 supported: Local accuracy varies a lot across systems and algorithms.
✅ H2 supported: Metric rankings shift between contexts.

🚨 Picking a metric based purely on global performance is risky!

Choose wisely. 🧙🏻‍♂️

(🧵7/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

We evaluate this framework across three tasks:
📝 Machine Translation (MT)
🎙 Automatic Speech Recognition (ASR)
📈 Ranking

We cover popular metrics like BLEU, COMET, BERTScore, WER, METEOR, nDCG, and more!

(🧵6/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

We test two hypotheses:
🧪H1: The absolute local accuracy of a metric changes as the context changes
🧪H2: The relative local accuracy (how metrics rank against each other) also changes across contexts

(🧵5/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

More formally: given an input x, an output y from a context c, and a degraded version y′, we ask: how often does the metric score y higher than y′ across all inputs in the context c?

We create y′ using perturbations that simulate realistic degradations automatically.

(🧵4/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

🎯 Metric accuracy measures how often a metric picks the better system output.
🌍 Global accuracy averages this over all outputs.
🔎 Local accuracy zooms in on a specific context (like a model, domain, or quality level).

Contexts are just meaningful slices of your data.

(🧵3/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

Most meta-evaluations look at global performance over arbitrary outputs. However, real-world use cases are highly contextual, tied to specific models or output qualities.

We introduce ✨local metric accuracy✨ to show how metric reliability can vary across settings.

(🧵2/9)

April 29, 2025 at 5:10 PM

Athiya Deviyani

@athiya.bsky.social

🙋‍♀️

November 18, 2024 at 4:05 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news