We welcome any discussion, thanks for reading!!
We welcome any discussion, thanks for reading!!
- Can further prompt review improve the difficulty of synthetic data?
- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
- Can further prompt review improve the difficulty of synthetic data?
- What other axes (representativeness, diversity) are affected when using LLMs to generate benchmarks?
- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.
- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
- While LLM generated evals may be 𝑣𝑎𝑙𝑖𝑑, as a whole they lose crucial aspects in complexity.
- LLMs are promising where complexity is less critical, but human annotators are vital for benchmarks assessing real-world generalization & nuanced scenarios.
We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.
We find that performance is consistently higher on generated versions of the datasets.
We explore the difficulty of synthetic benchmarks by comparing performance on synthetic & human-written data across a suite of models.
We find that performance is consistently higher on generated versions of the datasets.
We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
We ask NLP researchers to act as dataset creators and gather preferences between synthetic and human-authored data.
We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.
We find that validity is not an issue. We are able to get LLMs to generate instances that are highly valid according to our dataset specs.
Is this a valid approach to evaluation construction? Do we lose anything in this process?
Is this a valid approach to evaluation construction? Do we lose anything in this process?