Ahmed S. M. Elhady
ahmedelhady.bsky.social
Ahmed S. M. Elhady
@ahmedelhady.bsky.social
PhD Student @UPV/EHU, working on multilingual and multimodal GenAi. ex Microsoft and Agolo.
🤔 Recognizing NOTA requires better reasoning. Can chain of thought help reduce the gap?
We evaluated the Wicked variants of MMLU, MMLU-pro, and MMLU-Redux using 0-shot CoT. The performance drop is above 5%, showing that reasoning helps, but Wicked is challenging even for CoT.
February 26, 2025 at 11:51 AM
⚠️ Be careful not to break the coherence of the questions!
Our analysis identified questions with multiple correct candidates, yet only one being most suitable. Our method includes a model to automatically detect these questions, excluding them from the Wicked process.
February 26, 2025 at 11:51 AM
Method: We randomly replace a choice with "None of the above". NOTA should be chosen only when it replaces the correct answer. This method is often used in educational exams to assess the understanding of the examinees, encouraging thorough consideration of all options before answering.
February 26, 2025 at 11:51 AM
🧙‍♂️ New paper 🧙‍♀️:
Presenting Wicked: a simple automated method to make MCQA benchmarks more challenging. Wicked shook up 18 open-weight LLMs on 6 benchmarks, with up to 19.7% performance drop with direct prompting 🤯
Paper: shorturl.at/1CGq0
Code: shorturl.at/n2nCU
February 26, 2025 at 11:51 AM