Lightnews — Scholar-powered news

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.

Yes you got 67 BLEU points but is the resulting hair slaying? 💇

See the result on one datapoint (my head) at EMNLP.

November 3, 2025 at 5:49 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡

October 28, 2025 at 5:13 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251

- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175

- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549

October 28, 2025 at 9:45 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

...real interesting research problems I was passionate about and planning my research future.

You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.

October 24, 2025 at 12:32 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Congratulations, doctor! 🤓

October 22, 2025 at 4:14 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Organizers:
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.

October 20, 2025 at 10:37 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

... Further areas:
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬

October 20, 2025 at 10:37 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

... Complex & nuanced evaluation topics:
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️

Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge

October 20, 2025 at 10:37 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

We welcome short & long, archival & non-archival submissions!

Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️

October 20, 2025 at 10:37 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

I'd love to hear your feedback 🌿 arxiv.org/abs/2508.101...

See you in Suzhou! 🇨🇳

Estimating Machine Translation Difficulty

Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks. These high-quality outputs make it difficult to distinguish between state-of...

arxiv.org

September 16, 2025 at 8:51 AM

Reposted by Vilém Zouhar #EMNLP

Tom Kocmi

@kocmitom.bsky.social

We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.

August 23, 2025 at 9:28 AM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?

August 7, 2025 at 7:32 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...

Shared task: Automated Translation Quality Evaluation Systems

www2.statmt.org

July 25, 2025 at 4:59 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

📐Task 3: Quality-informed segment-level error correction

Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...

QE-informed Segment-level Error Correction

www2.statmt.org

July 25, 2025 at 4:59 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

📐Task 2: Span-level error detection

Identify and locate translation errors within each segment (start/end indices) and classify their severity.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...

Fine-grained error span detection

www2.statmt.org

July 25, 2025 at 4:59 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

📐Task 1: Segment-level quality score prediction

Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...

MT Evaluation Subtask 1: Segment-Level Quality Score Prediction

www2.statmt.org

July 25, 2025 at 4:59 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Faster but an extra dependency. 🤷

July 18, 2025 at 2:59 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂

July 15, 2025 at 11:24 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

Thank you everyone who helped. 😊

Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁

July 15, 2025 at 1:03 PM

Vilém Zouhar #EMNLP

@zouharvi.bsky.social

"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿

📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!

How to Select Datapoints for Efficient Human Evaluation of NLG Models?

Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practi...

arxiv.org

July 15, 2025 at 1:03 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news