📍{NYC, SFO, YYZ}
🔗 https://beirami.github.io/
Check out the podcast episode here: lnkd.in/eb6dWHDv
Check out the podcast episode here: lnkd.in/eb6dWHDv
I am super impressed with what the model did here (no other model gets even close -- including Google's previous model). This is truly bananas!
I am super impressed with what the model did here (no other model gets even close -- including Google's previous model). This is truly bananas!
Validating on smaller models is helpful in moving fast by ruling out ideas that are unlikely to work
My alignment research is driven by thinking about a ternary language model
Validating on smaller models is helpful in moving fast by ruling out ideas that are unlikely to work
My alignment research is driven by thinking about a ternary language model
But what about pass@k?
bsky.app/profile/abei...
But what about pass@k?
bsky.app/profile/abei...
Reward calibration gives a meaning to the raw reward across problems.
Calibration makes the reward of a random model output for a given prompt uniform[0, 1]. This is done offline via multiple rollouts of the base model prior to RL and applying inverse CDF.
Reward calibration gives a meaning to the raw reward across problems.
Calibration makes the reward of a random model output for a given prompt uniform[0, 1]. This is done offline via multiple rollouts of the base model prior to RL and applying inverse CDF.
Now that we know we are decoding from the model with pass@k or an adversary is jailbreaking the model with pass@k, how should we think about RL?
A short 🧵
Now that we know we are decoding from the model with pass@k or an adversary is jailbreaking the model with pass@k, how should we think about RL?
A short 🧵
openreview.net/forum?id=hIn...
This includes a direct comparison with GRPO. The point is not to be "better" than GRPO, but rather to show that calibration is the key to unlocking the performance gains that were obtained by GRPO.
openreview.net/forum?id=hIn...
This includes a direct comparison with GRPO. The point is not to be "better" than GRPO, but rather to show that calibration is the key to unlocking the performance gains that were obtained by GRPO.
Would be curious to see how this might help your usecases.
Would be curious to see how this might help your usecases.
But today we are optimizing workflows/agents.
But today we are optimizing workflows/agents.
The rollouts are done offline once and could be used across all epochs, hyperparameter sweeps, etc where we only roll out the model once per prompt online.
The rollouts are done offline once and could be used across all epochs, hyperparameter sweeps, etc where we only roll out the model once per prompt online.
Let me elaborate on what I mean by that and a cheaper way of doing it offline.
Let me elaborate on what I mean by that and a cheaper way of doing it offline.
Building agentic AI workflows and chasing 95%+ production‑ready reliability? Let’s swap wins & pain points over coffee. Email me to find time to chat.
Excited to reconnect with friends and meet new faces!
See session details in the 🧵👇
Building agentic AI workflows and chasing 95%+ production‑ready reliability? Let’s swap wins & pain points over coffee. Email me to find time to chat.
Excited to reconnect with friends and meet new faces!
See session details in the 🧵👇
Keeping the information latent doesn't help authors understand your point to improve their work, and doesn't help AC/SAC make a grounded and fair decision.
Keeping the information latent doesn't help authors understand your point to improve their work, and doesn't help AC/SAC make a grounded and fair decision.
For now, I am spending a week in the beautiful Tuscany, and will be back to work next week!
For now, I am spending a week in the beautiful Tuscany, and will be back to work next week!
I am truly grateful to the amazing colleagues who made the journey 1000x more fruitful and enjoyable! I am forever indebted to my collaborators who showed me how to be better at everything via demonstrations.
I am truly grateful to the amazing colleagues who made the journey 1000x more fruitful and enjoyable! I am forever indebted to my collaborators who showed me how to be better at everything via demonstrations.
• Single RL prompt? ✅
• No reward model? ✅
• Random Reward Generator? ✅
• Adversarial chaos? ✅
Whatever data you have (or don’t), Qwen adapts. (*Restrictions apply)
• Single RL prompt? ✅
• No reward model? ✅
• Random Reward Generator? ✅
• Adversarial chaos? ✅
Whatever data you have (or don’t), Qwen adapts. (*Restrictions apply)
See: arxiv.org/abs/2205.11275 and arxiv.org/abs/2404.01730
See: arxiv.org/abs/2205.11275 and arxiv.org/abs/2404.01730
May 12 marks the birthday of Maryam Mirzakhani, a mathematician who was awarded the Fields medal (highest honor in math) for her contributions to geometry and dynamical systems.
Two of my fav mathematicians:
Maryam Mirzakhani & Ingrid Daubechies
May 12 marks the birthday of Maryam Mirzakhani, a mathematician who was awarded the Fields medal (highest honor in math) for her contributions to geometry and dynamical systems.
Two of my fav mathematicians:
Maryam Mirzakhani & Ingrid Daubechies