Lightnews — Scholar-powered news

Skander Moalla

@skandermoalla.bsky.social

PhD @Caglar Gulcehre Lab for AI Research (CLAIRE) @EPFL. Deep Reinforcement Learning, RLHF, foundation models.
ML Research Template (https://github.com/CLAIRE-Labo/python-ml-research-template)

Posts Replies Media Videos

Skander Moalla

@skandermoalla.bsky.social

QRPO is a framework. You can shape the optimal policy! 🎛️
We derive a framework around QRPO for using transformations on top of the quantile reward.
Each transformation reshapes the reward distribution and affects the properties of the optimal policy, while having a tractable partition function.

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

And we show that for relatively high beta, with good data, the probabilities increase as predicted 💯

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

For QRPO, this is not a mystery anymore; we know exactly where the probabilities should move, and we explain how it's normal for them to decrease when the regularization (beta) is very low.
This is simply because the target policy is much further away from the training support 🎯

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

💬 The reward model we use has been trained to be robust to length bias, and we see that this is preserved in QRPO and REBEL, which use rewards.
But when compressed to preferences for DPO and SimPO, it leads to the typical length bias trend, despite the reduction in mean length.

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

🥇 QRPO achieves top performance in chat and coding compared to DPO, REBEL, and SimPO, each capturing a different way to learn from the reward signal (preference, reward difference, length normalization).

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

Obviously, nothing comes for free, but we give you a great deal! 🤝

* QRPO does not need many reference rewards to estimate quantiles: for high-quality offline datasets, 1-3 are enough!

* And you can scale this number for off-policy data generated from the reference model! 📈

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

3️⃣ We can transform the reward distribution to make it known. It's uniform for reward quantiles! 🔑

🚀 The result: Quantile Reward Policy Optimization!

QRPO transforms rewards to quantile rewards for which we derive Z, and can then fit the closed-form optimal RL solution with a simple regression! 📉

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

1️⃣ The “infinite sum over all possible LLM generations” argument is a myth. We rewrite the partition function Z in terms of rewards, revealing that Z is given by the moment generating function (MGF) of the reward distribution!

2️⃣ Knowing the reward distribution => knowing the MGF => knowing Z 🔐

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

🚀 Big time! We can finally do simple LLM RL fine-tuning with rewards and leverage offline/off-policy data!

❌ You want rewards, but GRPO only works online?
❌ You want offline, but DPO is limited to preferences?
✅ QRPO can do both!

🧵Here's how we do it:

July 15, 2025 at 6:45 PM

Skander Moalla

@skandermoalla.bsky.social

Also, check out our ML project template—it’s a game-changer!🚀🚀
@caglarai.bsky.social
🧑‍💻 github.com/CLAIRE-Labo/...

December 10, 2024 at 7:39 PM

Skander Moalla

@skandermoalla.bsky.social

Ever been puzzled by your PPO agent collapsing out of nowhere? 📈🤯📉 Come check out our poster tomorrow!
Wed 11 Dec 11 am - 2 pm PST
West Ballroom A-D #6403
@caglarai.bsky.social @andreamiele.bsky.social @razvan-pascanu.bsky.social

December 10, 2024 at 6:33 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news