they modified the questions with unnecessary information to distract the LLMs
It led to much lower accuracy even for o1
they modified the questions with unnecessary information to distract the LLMs
It led to much lower accuracy even for o1
“When a measure becomes a target, it ceases to be a good measure”
I.e. is better performance on benchmark tests translatable to real world performance?
“When a measure becomes a target, it ceases to be a good measure”
I.e. is better performance on benchmark tests translatable to real world performance?