📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00
arxiv.org/abs/2505.22830
📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00
arxiv.org/abs/2505.22830
We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.
We find that performance is consistently higher on generated versions of the datasets.
We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.
We find that performance is consistently higher on generated versions of the datasets.
We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.
We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.