Vilém Zouhar #EMNLP
@zouharvi.bsky.social
PhD student @ ETH Zürich | all aspects of NLP but mostly evaluation and MT | go vegan | https://vilda.net
NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.
Yes you got 67 BLEU points but is the resulting hair slaying? 💇
See the result on one datapoint (my head) at EMNLP.
Yes you got 67 BLEU points but is the resulting hair slaying? 💇
See the result on one datapoint (my head) at EMNLP.
November 3, 2025 at 5:49 AM
NLP evaluation is often detached from practical applications. Today I extrinsically evaluated one WMT25 translation system on the task of getting hair done without knowing Chinese.
Yes you got 67 BLEU points but is the resulting hair slaying? 💇
See the result on one datapoint (my head) at EMNLP.
Yes you got 67 BLEU points but is the resulting hair slaying? 💇
See the result on one datapoint (my head) at EMNLP.
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡
October 28, 2025 at 5:13 PM
The inspiration for the subset2evaluate poster comes from Henri Matisse's The Horse, the Rider and the Clown. 🐎🚴♀️🤡
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251
- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175
- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175
- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
October 28, 2025 at 9:45 AM
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? arxiv.org/abs/2501.18251
- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175
- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
- Estimating Machine Translation Difficulty arxiv.org/abs/2508.10175
- COMET-poly: Machine Translation Metric Grounded in Other Candidates arxiv.org/abs/2508.18549
...real interesting research problems I was passionate about and planning my research future.
You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
October 24, 2025 at 12:32 PM
...real interesting research problems I was passionate about and planning my research future.
You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
You should apply to these fellowships, even if it's for the exercise of periodically refining your research statement.
Congratulations, doctor! 🤓
October 22, 2025 at 4:14 PM
Congratulations, doctor! 🤓
Organizers:
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.
October 20, 2025 at 10:37 AM
Organizers:
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.
@pinzhen.bsky.social, @hanxuhu.bsky.social, @simi97k.bsky.social, Wenhao Zhu, @bazril.bsky.social , Alexandra Birch, @afaji.bsky.social, @ricosennrich.bsky.social, @sarahooker.bsky.social.
... Further areas:
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬
October 20, 2025 at 10:37 AM
... Further areas:
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬
- Metrics, LLM judges & reward models 🧮
- Standardised multilingual reporting 📊
- AI-assisted evaluation (data, methods, metrics, standards) 🤖
- Position, application- or theory-focused contributions 💬
... Complex & nuanced evaluation topics:
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️
Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️
Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge
October 20, 2025 at 10:37 AM
... Complex & nuanced evaluation topics:
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️
Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge
Multimodality 🎥
Fairness ⚖️
Long I/O 🧠
Tool use 🧰
Code-switching 🌍
Literary & creative tasks ✍️
Also:
- Sociocultural & cognitive variation
- Scalable evaluation of cultural & factual knowledge
We welcome short & long, archival & non-archival submissions!
Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️
Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️
October 20, 2025 at 10:37 AM
We welcome short & long, archival & non-archival submissions!
Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️
Topics include (but not limited to):
- Evaluation resources beyond English or Western-centric views 🌐
- Annotation methodology & procedures ✏️
- Evaluation protocols: ranking vs. direct, rubric/reference-based, prompt variation, etc. ⚖️
Reposted by Vilém Zouhar #EMNLP
We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.
August 23, 2025 at 9:28 AM
We saw increased momentum in participation growth this year: 36 unique teams competing to improve the performance of MT. Furthermore, we added collected outputs of 24 popular LLMs and online systems. Reaching 50 evaluated systems in our annual benchmark.
It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?
August 7, 2025 at 7:32 PM
It gets worse the more you look at it. Why is the height of 69.1 the same as the height of 30.8? Why are the labels rotated if there's enough space?
Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
Shared task: Automated Translation Quality Evaluation Systems
www2.statmt.org
July 25, 2025 at 4:59 PM
Organizers are happy to help with any questions. 🙂
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
Website with all details and contacts: www2.statmt.org/wmt25/mteval...
📐Task 3: Quality-informed segment-level error correction
Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
QE-informed Segment-level Error Correction
www2.statmt.org
July 25, 2025 at 4:59 PM
📐Task 3: Quality-informed segment-level error correction
Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Automatically post-edit machine-translated text using quality annotations to generate minimal and accurate corrections.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
📐Task 2: Span-level error detection
Identify and locate translation errors within each segment (start/end indices) and classify their severity.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Identify and locate translation errors within each segment (start/end indices) and classify their severity.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Fine-grained error span detection
www2.statmt.org
July 25, 2025 at 4:59 PM
📐Task 2: Span-level error detection
Identify and locate translation errors within each segment (start/end indices) and classify their severity.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Identify and locate translation errors within each segment (start/end indices) and classify their severity.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
📐Task 1: Segment-level quality score prediction
Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
MT Evaluation Subtask 1: Segment-Level Quality Score Prediction
www2.statmt.org
July 25, 2025 at 4:59 PM
📐Task 1: Segment-level quality score prediction
Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Predict a quality score for each source–target segment pair, using document-level context and either ESA or MQM annotations.
Description: www2.statmt.org/wmt25/mteval...
Submission platform: www.codabench.org/competitions...
Faster but an extra dependency. 🤷
July 18, 2025 at 2:59 PM
Faster but an extra dependency. 🤷
Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂
July 15, 2025 at 11:24 PM
Not possible post-hoc but possible for the other direction! Thanks for your paper. 🙂
Thank you everyone who helped. 😊
Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
July 15, 2025 at 1:03 PM
Thank you everyone who helped. 😊
Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
Special thanks to @mrinmaya.bsky.social and Peng Cui from @csateth.bsky.social and all my friends I bugged with proofreading. 😁
"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿
📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!
📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!
How to Select Datapoints for Efficient Human Evaluation of NLG Models?
Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practi...
arxiv.org
July 15, 2025 at 1:03 PM
"How to Select Datapoints for Efficient Human Evaluation of NLG Models?" has now been accepted to TACL (a)! 🌿
📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!
📃 Paper (with nuances and caveats ): arxiv.org/abs/2501.182...
📦 Package: github.com/zouharvi/sub...
Feedback welcome!