✅ MDPO matches previous SOTA with 60× fewer updates and improves over SOTA +9.6% on MATH-500, +54.2% on Countdown when trained with the same number of gradient updates.
✅ MDPO + RCR consistently outperforms all baselines
We see this as a step toward more sampling and data-efficient DLMs.
✅ MDPO matches previous SOTA with 60× fewer updates and improves over SOTA +9.6% on MATH-500, +54.2% on Countdown when trained with the same number of gradient updates.
✅ MDPO + RCR consistently outperforms all baselines
We see this as a step toward more sampling and data-efficient DLMs.
💡 To address these, we propose:
💡 To address these, we propose: