Alex Gill
agill32.bsky.social
Alex Gill
@agill32.bsky.social
NLP researcher at U of U
I'll be in Suzhou 🇨🇳 at #EMNLP this week presenting "What has been Lost with Synthetic Evaluation?" done with @anamarasovic.bsky.social & @lasha.bsky.social! 🎉

📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00

arxiv.org/abs/2505.22830
November 3, 2025 at 11:03 AM
But are these instances similarly difficult?

We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.

We find that performance is consistently higher on generated versions of the datasets.
June 4, 2025 at 10:24 PM
We perform a human study and even find that LLM-generated data is preferred!

We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
June 4, 2025 at 10:24 PM
We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP.

We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.
June 4, 2025 at 10:24 PM