ML Research Template (https://github.com/CLAIRE-Labo/python-ml-research-template)
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.
This is simply because the target policy is much further away from the training support 🎯
This is simply because the target policy is much further away from the training support 🎯
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.
* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!
* And you can scale this number for off-policy data generated from the reference model! 📈
* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!
* And you can scale this number for off-policy data generated from the reference model! 📈
🚀 The result: Quantile Reward Policy Optimization!
QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉
🚀 The result: Quantile Reward Policy Optimization!
QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐
❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
✅ QRPO can do both!
🧵Here's how we do it:
❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
✅ QRPO can do both!
🧵Here's how we do it:
@caglarai.bsky.social
🧑💻 github.com/CLAIRE-Labo/...
@caglarai.bsky.social
🧑💻 github.com/CLAIRE-Labo/...
Wed 11 Dec 11 am - 2 pm PST
West Ballroom A-D #6403
@caglarai.bsky.social @andreamiele.bsky.social @razvan-pascanu.bsky.social
Wed 11 Dec 11 am - 2 pm PST
West Ballroom A-D #6403
@caglarai.bsky.social @andreamiele.bsky.social @razvan-pascanu.bsky.social