We evaluated the Wicked variants of MMLU, MMLU-pro, and MMLU-Redux using 0-shot CoT. The performance drop is above 5%, showing that reasoning helps, but Wicked is challenging even for CoT.
We evaluated the Wicked variants of MMLU, MMLU-pro, and MMLU-Redux using 0-shot CoT. The performance drop is above 5%, showing that reasoning helps, but Wicked is challenging even for CoT.
Our analysis identified questions with multiple correct candidates, yet only one being most suitable. Our method includes a model to automatically detect these questions, excluding them from the Wicked process.
Our analysis identified questions with multiple correct candidates, yet only one being most suitable. Our method includes a model to automatically detect these questions, excluding them from the Wicked process.
Presenting Wicked: a simple automated method to make MCQA benchmarks more challenging. Wicked shook up 18 open-weight LLMs on 6 benchmarks, with up to 19.7% performance drop with direct prompting 🤯
Paper: shorturl.at/1CGq0
Code: shorturl.at/n2nCU
Presenting Wicked: a simple automated method to make MCQA benchmarks more challenging. Wicked shook up 18 open-weight LLMs on 6 benchmarks, with up to 19.7% performance drop with direct prompting 🤯
Paper: shorturl.at/1CGq0
Code: shorturl.at/n2nCU