Comparing 3 inference-time scaled models against their backbones, we find distinct improvements in the Reasoning subset:
Comparing 3 inference-time scaled models against their backbones, we find distinct improvements in the Reasoning subset:
⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medical licensing level or higher
⭐️Realistic context: MM is the first multimodal medical benchmark to introduce rich clinical information with diverse image types
⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medical licensing level or higher
⭐️Realistic context: MM is the first multimodal medical benchmark to introduce rich clinical information with diverse image types
Full results evaluating 17 LLMs, LMMs, and inference-time scaled models:
Full results evaluating 17 LLMs, LMMs, and inference-time scaled models:
📌Percentage scores on our Text subset:
o3-mini: 37.30
R1: 37.76 - frontrunner among open-source models
o1: 44.67 - still room for improvement!
📌Percentage scores on our Text subset:
o3-mini: 37.30
R1: 37.76 - frontrunner among open-source models
o1: 44.67 - still room for improvement!