Alex Gill
agill32.bsky.social
Alex Gill
@agill32.bsky.social
NLP researcher at U of U
Reposted by Alex Gill
Folks, I don’t know how it’s possible, but it gets funnier.
November 21, 2025 at 3:19 PM
More results and analysis can be found in the paper.

We welcome any discussion, thanks for reading!!
June 4, 2025 at 10:24 PM
We hope that our work will inspire future research into:

- Can further prompt review improve the difficulty of synthetic data?

- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
June 4, 2025 at 10:24 PM
Key takeways:

- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.

- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
June 4, 2025 at 10:24 PM
But are these instances similarly difficult?

We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.

We find that performance is consistently higher on generated versions of the datasets.
June 4, 2025 at 10:24 PM
We perform a human study and even find that LLM-generated data is preferred!

We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
June 4, 2025 at 10:24 PM
We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP.

We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.
June 4, 2025 at 10:24 PM
We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs.

Is this a valid approach to evaluation construction? Do we lose anything in this process?
June 4, 2025 at 10:24 PM