📒Preprint: arxiv.org/pdf/2501.18362
🗃️Data files will be released shortly at: github.com/TsinghuaC3I/...
📒Preprint: arxiv.org/pdf/2501.18362
🗃️Data files will be released shortly at: github.com/TsinghuaC3I/...
Comparing 3 inference-time scaled models against their backbones, we find distinct improvements in the Reasoning subset:
Comparing 3 inference-time scaled models against their backbones, we find distinct improvements in the Reasoning subset:
- Filtering for difficulty and diversity using responses from humans + 8 AI experts
- Question rewriting & option set expansion to lower data leakage risk
- Human expert proofreading & error correction
- Filtering for difficulty and diversity using responses from humans + 8 AI experts
- Question rewriting & option set expansion to lower data leakage risk
- Human expert proofreading & error correction
⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medical licensing level or higher
⭐️Realistic context: MM is the first multimodal medical benchmark to introduce rich clinical information with diverse image types
⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medical licensing level or higher
⭐️Realistic context: MM is the first multimodal medical benchmark to introduce rich clinical information with diverse image types
Full results evaluating 17 LLMs, LMMs, and inference-time scaled models:
Full results evaluating 17 LLMs, LMMs, and inference-time scaled models: