Vilém Zouhar #EMNLP
banner
zouharvi.bsky.social
Vilém Zouhar #EMNLP
@zouharvi.bsky.social
PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
Let's talk about eval (automatic or human) and multilinguality at #EMNLP in Suzhou! 🇨🇳

- Efficient evaluation (Nov 5, 16:30, poster session 3)
- MT difficulty (Nov 7, 12:30, findings 3)
- COMET-poly (Nov 8, 11:00, WMT)

(DM to meet 🌿 )
October 28, 2025 at 9:45 AM
Grateful to receive the Google PhD Fellowship in NLP! 🙂

I am not secretive about having applied to 4 similar fellowships during my PhD before and not succeeding. Still, refining my research statement (part of the application) helped me tremendously in finding out the...

inf.ethz.ch/news-and-eve...
Google PhD Fellowships 2025
Yutong Chen, Benedict Schlüter and Vilém Zouhar, all three of them doctoral students at the Department of Computer Science, have been awarded the Google PhD Fellowship. The programme was created to re...
inf.ethz.ch
October 24, 2025 at 12:32 PM
📢 Announcing the First Workshop on Multilingual and Multicultural Evaluation (MME) at #EACL2026 🇲🇦

MME focuses on resources, metrics & methodologies for evaluating multilingual systems! multilingual-multicultural-evaluation.github.io

📅 Workshop Mar 24–29, 2026
🗓️ Submit by Dec 19, 2025
October 20, 2025 at 10:37 AM
My two biggest take-aways are:
- Standard testsets are too easy (Figure 1).
- We can make testsets that are not easy (Figure 2). 😎
September 16, 2025 at 8:49 AM
Reposted by Vilém Zouhar #EMNLP
We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.
August 23, 2025 at 9:28 AM
The 2025 MT Evaluation shared task brings together the strengths of the previous Metrics and Quality Estimation tasks under a single, unified evaluation framework.

The following tasks are now open for participants (deadline July 31st but participation has never been easier 🙂 ):
July 25, 2025 at 4:59 PM
You have a budget to human-evaluate 100 inputs to your models, but your dataset is 10,000 inputs. Do not just pick 100 randomly!🙅

We can do better. "How to Select Datapoints for Efficient Human Evaluation of NLG Models?" shows how.🕵️
(random is still a devilishly good baseline)
July 15, 2025 at 1:03 PM
TIL that since python3.4 there's default `statistics` module with things like mean, mode, quantiles, variance, covariance, correlations, zscore, and more!. No more needless numpy imports!
July 9, 2025 at 12:49 AM
Past iterations of the Terminology Shared Task don't come anywhere near the data quality and evaluation scrutiny of this one. In the era of LLM-as-MTs, participation has never been easier!
📣Take part in 3rd Terminology shared task @WMT!📣
This year:
👉5 language pairs: EN->{ES, RU, DE, ZH},
👉2 tracks - sentence-level and doc-level translation,
👉authentic data from 2 domains: finance and IT!

www2.statmt.org/wmt25/termin...

Don't miss an opportunity - we only do it once in two years😏
Terminology Translation Task
www2.statmt.org
July 7, 2025 at 2:34 PM
Thank you for your response. I will keep my score.
July 3, 2025 at 6:50 PM
For the longest time I've been using Google Translate as a gateway to explain machine translation concepts to people as it's a tool that everyone knows. Now I get to contribute over the summer. 🌞

If you're near Mountain View, let's talk evaluation. 📏
July 3, 2025 at 4:15 AM
arxiv submission process got an update!
(still requires a manual bbl)
May 31, 2025 at 9:56 AM
Reposted by Vilém Zouhar #EMNLP
XCOMETs underperform because they do not match translators' subjective error annotation propensity. Using the granular p(error) value from XCOMET significantly boost their performance when calibration is possible → desirable for a fair evaluation 6/
May 30, 2025 at 2:28 PM
Reposted by Vilém Zouhar #EMNLP
Key takeaways for WQE evals:
1️⃣ Unsup. WQE shows promise (esp. uncertainty-based ones), interp approaches under-explored for MT
2️⃣ Calibration sets can help to ensure fair evaluations.
3️⃣ Use multiple annotators for robust rakings.

More info ➡️ arxiv.org/abs/2505.23183 8/8
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. ...
arxiv.org
May 30, 2025 at 2:28 PM
Reposted by Vilém Zouhar #EMNLP
📢 New paper: Can unsupervised metrics extracted from MT models detect their translation errors reliably? Do annotators even *agree* on what constitutes an error? 🧐

We compare uncertainty- and interp-based WQE metrics across 12 directions, with some surprising findings!

🧵 1/
May 30, 2025 at 2:28 PM
incredible monetization opportunity

(this is a joke)
May 14, 2025 at 8:52 AM
Ever looked down from a hot air balloon and despaired at how expensive it is to run thorough human evaluation of machine translation?
Fret no more and come tomorrow at 11:00 to Hall 3 #NAACL2025.
May 2, 2025 at 12:30 AM
Being in a hot air balloon in Albuquerque really makes one ponder *how to efficiently pick the best translation candidate without running expensive evaluation metrics on all of them.*

See you tomorrow at 9:00 in Hall 3 #NAACL2025.
May 2, 2025 at 12:26 AM
April 30, 2025 at 11:18 PM
Let's chat at #NAACL2025 about evaluation et al! ⚗️
April 22, 2025 at 7:35 AM
(Automatic) span annotations are the future of evaluation and diagnosis in NLP! 🖊️
April 15, 2025 at 11:30 AM
Reposted by Vilém Zouhar #EMNLP
How do LLMs compare to human crowdworkers in annotating text spans? 🧑🤖

And how can span annotation help us with evaluating texts?

Find out in our new paper: llm-span-annotators.github.io

Arxiv: arxiv.org/abs/2504.08697
Large Language Models as Span Annotators
Website for the paper Large Language Models as Span Annotators
llm-span-annotators.github.io
April 15, 2025 at 11:10 AM
In the true sense of the word I am humbled to have been rejected by a few fellowships this year.
April 11, 2025 at 7:21 AM
Overall 3.0 (borderline reject) but with 43% tariff adjustments it's 4.5 (borderline award).
April 3, 2025 at 7:56 AM
Panopticon, but instead of prison cells, it's a stack of Overleaf tabs. You can’t watch them all at once, but at any moment, you *could* be watching any of them. And they know it.
March 25, 2025 at 9:48 AM