haoyuhe.bsky.social
@haoyuhe.bsky.social
PhD student @ AVG, University of Tübingen
📊 Results:
✅ MDPO matches previous SOTA with 60× fewer updates and improves over SOTA +9.6% on MATH-500, +54.2% on Countdown when trained with the same number of gradient updates.
✅ MDPO + RCR consistently outperforms all baselines

We see this as a step toward more sampling and data-efficient DLMs.
August 20, 2025 at 9:40 AM
🔹 Running Confidence Remasking (RCR) – a training-free decoding strategy that allows revising low-confidence tokens flexibly.
August 20, 2025 at 9:40 AM
🔹 Masked Diffusion Policy Optimization – the first policy gradient method that optimizes MDLMs as a sequential decision-making process. MDPO exploits the nice property that MDLMs yield text completions at every inference step and optimizes the model with with intermediate-step rewards.
August 20, 2025 at 9:40 AM
👉 Rigid Remasking: The remasking schedules used in inference usually 'freeze' tokens once predicted and not remasked immediately, making it impossible to revise low-confidence tokens predicted at early steps.
💡 To address these, we propose:
August 20, 2025 at 9:40 AM
👉 Training–Inference Divide: MDLMs are trained to predict all randomly masked tokens in a single pass, while at inference, MDLMs follow a model-dependent, confidence-guided unmasking schedule that progressively reveals structure of the generated sequence.
August 20, 2025 at 9:40 AM
Masked diffusion language models (MDLMs) are rising as powerful alternatives to autoregressive LMs, yet face two fundamental but overlooked problems:
August 20, 2025 at 9:40 AM