Lightnews — Scholar-powered news

Ahmad Beirami

@abeirami.bsky.social

This is the conclusion slide of a talk I gave more than a year ago on RL/Alignment! It still holds true today.

Slide titled “Takeaways (alignment recipe).”

Step 1: Perform Best-of-n and make sure it works as desired.
– Inspect a few responses and verify the reward-induced ranking makes sense.
– Best-of-n gives the best trade-offs; if it doesn’t work, no fancy method will.
– You can debug best-of-n much faster.

Step 2: Only then train your favorite alignment method.
– Track KL(π‖p) throughout training:
• KL > 100: results are unlikely to be useful.
• KL > 15: inspect outcomes for reward hacking.
• KL < 8: you are probably OK.

Bottom banner in a black box repeats “(1) Look at your data! (2) Look at your data! (3) Look at your data!” in blue, green, and red.

September 10, 2025 at 1:07 PM

Ahmad Beirami

@abeirami.bsky.social

Enjoyed speaking with @DeltaInstitutes about going from information theory to ML, recent safety alignment/RL work, and lessons on RL for LLMs that stuck!

Check out the podcast episode here: lnkd.in/eb6dWHDv

August 30, 2025 at 11:15 AM

Ahmad Beirami

@abeirami.bsky.social

Congratulations to the Google team on the release of the newest Gemini Image generation model! 🍌🍌

I am super impressed with what the model did here (no other model gets even close -- including Google's previous model). This is truly bananas!

August 26, 2025 at 4:10 PM

Ahmad Beirami

@abeirami.bsky.social

What are the founders going to own? 🤔

August 25, 2025 at 11:01 PM

Ahmad Beirami

@abeirami.bsky.social

A research project needs test beds of different granularity. Iterating at the large scale is expensive and hard to debug

Validating on smaller models is helpful in moving fast by ruling out ideas that are unlikely to work

My alignment research is driven by thinking about a ternary language model

August 20, 2025 at 11:46 AM

Ahmad Beirami

@abeirami.bsky.social

Once these two steps are done, RL can be done with any algorithm (we used PPO) leading to significant quality improvements compared to baselines.

August 19, 2025 at 12:17 AM

Ahmad Beirami

@abeirami.bsky.social

GRPO arguably does a form of reward calibration, which is key to its performance gains as well. We show that offline reward calibration is competitive with GRPO on standard sampling (pass@1).

But what about pass@k?
bsky.app/profile/abei...

August 19, 2025 at 12:17 AM

Ahmad Beirami

@abeirami.bsky.social

> Reward calibration

Reward calibration gives a meaning to the raw reward across problems.

Calibration makes the reward of a random model output for a given prompt uniform[0, 1]. This is done offline via multiple rollouts of the base model prior to RL and applying inverse CDF.

August 19, 2025 at 12:17 AM

Ahmad Beirami

@abeirami.bsky.social

pass@k is an effective decoding paradigm to improve models at math and also for an attacker to jailbreak the models.

Now that we know we are decoding from the model with pass@k or an adversary is jailbreaking the model with pass@k, how should we think about RL?

A short 🧵

August 19, 2025 at 12:17 AM

Ahmad Beirami

@abeirami.bsky.social

The updated paper is on openreview:
openreview.net/forum?id=hIn...

This includes a direct comparison with GRPO. The point is not to be "better" than GRPO, but rather to show that calibration is the key to unlocking the performance gains that were obtained by GRPO.

August 16, 2025 at 9:23 PM

Ahmad Beirami

@abeirami.bsky.social

This is how offline calibration + PPO compares against GRPO on helpfulness BT rewards.

Would be curious to see how this might help your usecases.

August 16, 2025 at 6:48 PM

Ahmad Beirami

@abeirami.bsky.social

Post-training research was fueled by the KL-regularized RL mathematical foundation. That led to a lot of algorithmic research and a ton of progress over a few years. This helped us learn how to "distill" metrics back into models.

But today we are optimizing workflows/agents.

August 10, 2025 at 3:24 PM

Ahmad Beirami

@abeirami.bsky.social

This leads to sizable improvement over standard RL and sets a new SOTA compared to various more sophisticated algorithms.

August 9, 2025 at 2:50 PM

Ahmad Beirami

@abeirami.bsky.social

The resulting algorithm is a two line change to your favorite standard RL algorithm with the simple inverse CDF transformation of the reward.

The rollouts are done offline once and could be used across all epochs, hyperparameter sweeps, etc where we only roll out the model once per prompt online.

August 9, 2025 at 2:50 PM

Ahmad Beirami

@abeirami.bsky.social

This led to a very simple algorithm where the multiple rollouts happened offline prior to training, and the raw values of the reward were recorded so that the reward of any response could be calibrated via an empirical CDF inverse at training time.

August 9, 2025 at 2:50 PM

Ahmad Beirami

@abeirami.bsky.social

In concurrent work, we had also observed a similar phenomenon where instead of calibrating the reward to be N(0, 1), we calibrated the reward to be U[0, 1] with respect to the responses of the reference model, motivated by theoretical findings.

August 9, 2025 at 2:50 PM

Ahmad Beirami

@abeirami.bsky.social

In fact, BT reward distribution could vary significantly for different prompts. Here, you are looking at the helpfulness scores of PaLM-2 for 10 different prompts. Each curve represents one of 10 random prompts with scores calibrated over 100 different responses per prompt.

August 9, 2025 at 2:50 PM

Ahmad Beirami

@abeirami.bsky.social

The main ingredient that led to GRPO's performance leap is the calibration of the reward/value via multiple rollouts per prompt.

Let me elaborate on what I mean by that and a cheaper way of doing it offline.

August 9, 2025 at 2:50 PM

Ahmad Beirami

@abeirami.bsky.social

🚀 Heading to #ICML2025 in Vancouver 🇨🇦 (Jul 15‑18)!

Building agentic AI workflows and chasing 95%+ production‑ready reliability? Let’s swap wins & pain points over coffee. Email me to find time to chat.

Excited to reconnect with friends and meet new faces!

See session details in the 🧵👇

July 9, 2025 at 10:41 PM

Ahmad Beirami

@abeirami.bsky.social

As you react to EMNLP rebuttals, Please make sure to substantiate your stance on why the score you have chosen is justified

Keeping the information latent doesn't help authors understand your point to improve their work, and doesn't help AC/SAC make a grounded and fair decision.

July 2, 2025 at 2:36 PM

Ahmad Beirami

@abeirami.bsky.social

I am taking the time to explore these applications and where/how I can contribute to make this a reality. If you are also excited about what AI Agents can do, please do get in touch!

For now, I am spending a week in the beautiful Tuscany, and will be back to work next week!

June 2, 2025 at 8:35 PM

Ahmad Beirami

@abeirami.bsky.social

After three incredible years, today is my last day at Google DeepMind!

I am truly grateful to the amazing colleagues who made the journey 1000x more fruitful and enjoyable! I am forever indebted to my collaborators who showed me how to be better at everything via demonstrations.

June 2, 2025 at 8:35 PM

Ahmad Beirami

@abeirami.bsky.social

Your RL method is not working? — build on Qwen, the base model built for RL.
• Single RL prompt? ✅
• No reward model? ✅
• Random Reward Generator? ✅
• Adversarial chaos? ✅
Whatever data you have (or don’t), Qwen adapts. (*Restrictions apply)

May 29, 2025 at 11:27 AM

Ahmad Beirami

@abeirami.bsky.social

KL-regularized RL actually has a closed form solution (aligned distribution), which is the reference policy exponentially tilted with the reward, leading to an exponential family of distributions with a lot of cool properties.
See: arxiv.org/abs/2205.11275 and arxiv.org/abs/2404.01730

May 27, 2025 at 5:40 PM

Ahmad Beirami

@abeirami.bsky.social

Happy #WomeninMathematics day!

May 12 marks the birthday of Maryam Mirzakhani, a mathematician who was awarded the Fields medal (highest honor in math) for her contributions to geometry and dynamical systems.

Two of my fav mathematicians:
Maryam Mirzakhani & Ingrid Daubechies

May 12, 2025 at 2:54 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news