ML Research Template (https://github.com/CLAIRE-Labo/python-ml-research-template)
📰 Paper: arxiv.org/abs/2507.08068
Hidden gems and open questions in the 30+ page appendix💎
🧑💻 Code: github.com/CLAIRE-Labo/...
🌐 Blog: claire-labo.github.io/quantile-rewar
📰 Paper: arxiv.org/abs/2507.08068
Hidden gems and open questions in the 30+ page appendix💎
🧑💻 Code: github.com/CLAIRE-Labo/...
🌐 Blog: claire-labo.github.io/quantile-rewar
We show equivalence of a family of transformations allowing us to qualitatively interpret the quantile reward optimal as a Best-of-N policy 🎯
Empirically, each transformation brings different dynamics, and it's an open question to compare all of them! 🕵️
We show equivalence of a family of transformations allowing us to qualitatively interpret the quantile reward optimal as a Best-of-N policy 🎯
Empirically, each transformation brings different dynamics, and it's an open question to compare all of them! 🕵️
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.
This is simply because the target policy is much further away from the training support 🎯
This is simply because the target policy is much further away from the training support 🎯
Our understanding of the KL-regularized closed-form solution gives insights into the "DPO chosen probabilities decreasing" problem! 🤔
Our understanding of the KL-regularized closed-form solution gives insights into the "DPO chosen probabilities decreasing" problem! 🤔
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!
* And you can scale this number for off-policy data generated from the reference model! 📈
* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!
* And you can scale this number for off-policy data generated from the reference model! 📈
🚀 The result: Quantile Reward Policy Optimization!
QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉
🚀 The result: Quantile Reward Policy Optimization!
QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
This is the problem that limits DPO-like methods to pairwise data. We solve it thanks to 3 insights! 💡
This is the problem that limits DPO-like methods to pairwise data. We solve it thanks to 3 insights! 💡
📰 Paper: arxiv.org/abs/2405.00662
🧑💻 Code: github.com/CLAIRE-Labo/...
📰 Paper: arxiv.org/abs/2405.00662
🧑💻 Code: github.com/CLAIRE-Labo/...