Lightnews — Scholar-powered news

Alex Gill

@agill32.bsky.social

390 followers 360 following 9 posts

NLP researcher at U of U

Posts Replies Media Videos

Reposted by Alex Gill

Nathan Kalman-Lamb

@nkalamb.bsky.social

Folks, I don’t know how it’s possible, but it gets funnier.

November 21, 2025 at 3:19 PM

Alex Gill

@agill32.bsky.social

More results and analysis can be found in the paper.

We welcome any discussion, thanks for reading!!

June 4, 2025 at 10:24 PM

Alex Gill

@agill32.bsky.social

We hope that our work will inspire future research into:

- Can further prompt review improve the difficulty of synthetic data?

- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?

June 4, 2025 at 10:24 PM

Alex Gill

@agill32.bsky.social

Key takeways:

- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.

- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.

June 4, 2025 at 10:24 PM

Alex Gill

@agill32.bsky.social

But are these instances similarly difficult?

We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.

We find that performance is consistently higher on generated versions of the datasets.

June 4, 2025 at 10:24 PM

Alex Gill

@agill32.bsky.social

We perform a human study and even find that LLM-generated data is preferred!

We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.

June 4, 2025 at 10:24 PM

Alex Gill

@agill32.bsky.social

We examine both the 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦 and 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA & DROP.

We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.

June 4, 2025 at 10:24 PM

Alex Gill

@agill32.bsky.social

We are increasingly seeing LLMs being used to create challenging benchmarks that are then used for evaluating LLMs.

Is this a valid approach to evaluation construction? Do we lose anything in this process?

June 4, 2025 at 10:24 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news