Ahmad Beirami
abeirami.bsky.social
Ahmad Beirami
@abeirami.bsky.social
stealth // Gemini RL+inference @ Google DeepMind // Conversational AI @ Meta // RL Agents @ EA // ML+Information Theory @ MIT+Harvard+Duke // Georgia Tech PhD // زن زندگی آزادی
📍{NYC, SFO, YYZ}
🔗 https://beirami.github.io/
This is the conclusion slide of a talk I gave more than a year ago on RL/Alignment! It still holds true today.
September 10, 2025 at 1:07 PM
Enjoyed speaking with @DeltaInstitutes about going from information theory to ML, recent safety alignment/RL work, and lessons on RL for LLMs that stuck!

Check out the podcast episode here: lnkd.in/eb6dWHDv
August 30, 2025 at 11:15 AM
Congratulations to the Google team on the release of the newest Gemini Image generation model! 🍌🍌

I am super impressed with what the model did here (no other model gets even close -- including Google's previous model). This is truly bananas!
August 26, 2025 at 4:10 PM
What are the founders going to own? 🤔
August 25, 2025 at 11:01 PM
A research project needs test beds of different granularity. Iterating at the large scale is expensive and hard to debug

Validating on smaller models is helpful in moving fast by ruling out ideas that are unlikely to work

My alignment research is driven by thinking about a ternary language model
August 20, 2025 at 11:46 AM
Once these two steps are done, RL can be done with any algorithm (we used PPO) leading to significant quality improvements compared to baselines.
August 19, 2025 at 12:17 AM
GRPO arguably does a form of reward calibration, which is key to its performance gains as well. We show that offline reward calibration is competitive with GRPO on standard sampling (pass@1).

But what about pass@k?
bsky.app/profile/abei...
August 19, 2025 at 12:17 AM
> Reward calibration

Reward calibration gives a meaning to the raw reward across problems.

Calibration makes the reward of a random model output for a given prompt uniform[0, 1]. This is done offline via multiple rollouts of the base model prior to RL and applying inverse CDF.
August 19, 2025 at 12:17 AM
pass@k is an effective decoding paradigm to improve models at math and also for an attacker to jailbreak the models.

Now that we know we are decoding from the model with pass@k or an adversary is jailbreaking the model with pass@k, how should we think about RL?

A short 🧵
August 19, 2025 at 12:17 AM
The updated paper is on openreview:
openreview.net/forum?id=hIn...

This includes a direct comparison with GRPO. The point is not to be "better" than GRPO, but rather to show that calibration is the key to unlocking the performance gains that were obtained by GRPO.
August 16, 2025 at 9:23 PM
This is how offline calibration + PPO compares against GRPO on helpfulness BT rewards.

Would be curious to see how this might help your usecases.
August 16, 2025 at 6:48 PM
Post-training research was fueled by the KL-regularized RL mathematical foundation. That led to a lot of algorithmic research and a ton of progress over a few years. This helped us learn how to "distill" metrics back into models.

But today we are optimizing workflows/agents.
August 10, 2025 at 3:24 PM
This leads to sizable improvement over standard RL and sets a new SOTA compared to various more sophisticated algorithms.
August 9, 2025 at 2:50 PM
The resulting algorithm is a two line change to your favorite standard RL algorithm with the simple inverse CDF transformation of the reward.

The rollouts are done offline once and could be used across all epochs, hyperparameter sweeps, etc where we only roll out the model once per prompt online.
August 9, 2025 at 2:50 PM
This led to a very simple algorithm where the multiple rollouts happened offline prior to training, and the raw values of the reward were recorded so that the reward of any response could be calibrated via an empirical CDF inverse at training time.
August 9, 2025 at 2:50 PM
In concurrent work, we had also observed a similar phenomenon where instead of calibrating the reward to be N(0, 1), we calibrated the reward to be U[0, 1] with respect to the responses of the reference model, motivated by theoretical findings.
August 9, 2025 at 2:50 PM
In fact, BT reward distribution could vary significantly for different prompts. Here, you are looking at the helpfulness scores of PaLM-2 for 10 different prompts. Each curve represents one of 10 random prompts with scores calibrated over 100 different responses per prompt.
August 9, 2025 at 2:50 PM
The main ingredient that led to GRPO's performance leap is the calibration of the reward/value via multiple rollouts per prompt.

Let me elaborate on what I mean by that and a cheaper way of doing it offline.
August 9, 2025 at 2:50 PM
🚀 Heading to #ICML2025 in Vancouver 🇨🇦 (Jul 15‑18)!

Building agentic AI workflows and chasing 95%+ production‑ready reliability? Let’s swap wins & pain points over coffee. Email me to find time to chat.

Excited to reconnect with friends and meet new faces!

See session details in the 🧵👇
July 9, 2025 at 10:41 PM
As you react to EMNLP rebuttals, Please make sure to substantiate your stance on why the score you have chosen is justified

Keeping the information latent doesn't help authors understand your point to improve their work, and doesn't help AC/SAC make a grounded and fair decision.
July 2, 2025 at 2:36 PM
I am taking the time to explore these applications and where/how I can contribute to make this a reality. If you are also excited about what AI Agents can do, please do get in touch!

For now, I am spending a week in the beautiful Tuscany, and will be back to work next week!
June 2, 2025 at 8:35 PM
After three incredible years, today is my last day at Google DeepMind!

I am truly grateful to the amazing colleagues who made the journey 1000x more fruitful and enjoyable! I am forever indebted to my collaborators who showed me how to be better at everything via demonstrations.
June 2, 2025 at 8:35 PM
Your RL method is not working? — build on Qwen, the base model built for RL.
• Single RL prompt? ✅
• No reward model? ✅
• Random Reward Generator? ✅
• Adversarial chaos? ✅
Whatever data you have (or don’t), Qwen adapts. (*Restrictions apply)
May 29, 2025 at 11:27 AM
KL-regularized RL actually has a closed form solution (aligned distribution), which is the reference policy exponentially tilted with the reward, leading to an exponential family of distributions with a lot of cool properties.
See: arxiv.org/abs/2205.11275 and arxiv.org/abs/2404.01730
May 27, 2025 at 5:40 PM
Happy #WomeninMathematics day!

May 12 marks the birthday of Maryam Mirzakhani, a mathematician who was awarded the Fields medal (highest honor in math) for her contributions to geometry and dynamical systems.

Two of my fav mathematicians:
Maryam Mirzakhani & Ingrid Daubechies
May 12, 2025 at 2:54 PM