Vilém Zouhar #EMNLP
banner
zouharvi.bsky.social
Vilém Zouhar #EMNLP
@zouharvi.bsky.social
PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.

Yes you got 67 BLEU points but is the resulting hair slaying? 💇

See the result on one datapoint (my head) at EMNLP.
November 3, 2025 at 5:49 AM
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴‍♀️🤡
October 28, 2025 at 5:13 PM
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251

- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175

- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
October 28, 2025 at 9:45 AM
...real interesting research problems I was passionate about and planning my research future.

You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
October 24, 2025 at 12:32 PM
Congratulations, doctor! 🤓
October 22, 2025 at 4:14 PM
... Further areas:
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬
October 20, 2025 at 10:37 AM
... Complex & nuanced evaluation topics:
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️

Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge
October 20, 2025 at 10:37 AM
We welcome short & long, archival & non-archival submissions!

Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️
October 20, 2025 at 10:37 AM
Reposted by Vilém Zouhar #EMNLP
We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.
August 23, 2025 at 9:28 AM
It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?
August 7, 2025 at 7:32 PM
Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
Shared task: Automated Translation Quality Evaluation Systems
www2.statmt.org
July 25, 2025 at 4:59 PM
📐Task 3: Quality-informed segment-level error correction

Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...
QE-informed Segment-level Error Correction
www2.statmt.org
July 25, 2025 at 4:59 PM
📐Task 2: Span-level error detection

Identify and locate translation errors within each segment (start/end indices) and classify their severity.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...
Fine-grained error span detection
www2.statmt.org
July 25, 2025 at 4:59 PM
📐Task 1: Segment-level quality score prediction

Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.

Description: www2.statmt.org/wmt25/mteval...

Submission platform: www.codabench.org/competitions...
MT Evaluation Subtask 1: Segment-Level Quality Score Prediction
www2.statmt.org
July 25, 2025 at 4:59 PM
Faster but an extra dependency. 🤷
July 18, 2025 at 2:59 PM
Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂
July 15, 2025 at 11:24 PM
Thank you everyone who helped. 😊

Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
July 15, 2025 at 1:03 PM
"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿

📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!
How to Select Datapoints for Efficient Human Evaluation of NLG Models?
Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practi...
arxiv.org
July 15, 2025 at 1:03 PM