Performance of reasoning models drop significantly evaluated based on multiple choice questions in which the correct answer was replaced with 'None of the others'
arxiv.org/abs/2502.12896
Performance of reasoning models drop significantly evaluated based on multiple choice questions in which the correct answer was replaced with 'None of the others'
arxiv.org/abs/2502.12896