We evaluated the Wicked variants of MMLU, MMLU-pro, and MMLU-Redux using 0-shot CoT. The performance drop is above 5%, showing that reasoning helps, but Wicked is challenging even for CoT.
We evaluated the Wicked variants of MMLU, MMLU-pro, and MMLU-Redux using 0-shot CoT. The performance drop is above 5%, showing that reasoning helps, but Wicked is challenging even for CoT.
Our analysis identified questions with multiple correct candidates, yet only one being most suitable. Our method includes a model to automatically detect these questions, excluding them from the Wicked process.
Our analysis identified questions with multiple correct candidates, yet only one being most suitable. Our method includes a model to automatically detect these questions, excluding them from the Wicked process.