Noam Razin
noamrazin.bsky.social
Noam Razin
@noamrazin.bsky.social
Postdoctoral Fellow at Princeton Language and Intelligence | Past: Computer Science PhD at Tel Aviv University & Apple Scholar in AI/ML | Interested in the foundations of deep learning

https://noamrazin.github.io/
Overall, our results highlight that seemingly minor design choices can substantially impact how RMs generalize. We hope that it will encourage further research into understanding the implicit biases of different RM types.

5/6
July 11, 2025 at 5:32 PM
We also challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification, since they can operate both as a verifier and generator. We prove and show empirically that IM-RMs do not need to learn to generate in order to verify responses.

4/6
July 11, 2025 at 5:32 PM
TL;DR: Through theory and experiments, we find that IM-RMs rely more heavily on superficial token-level cues. As a result, they often generalize worse under token-level shifts, as well as in-distribution, but actually generalize comparably or better under domain shifts.

3/6
July 11, 2025 at 5:32 PM
As the DPO paper showed, every LM defines an IM-RM. However, prior work observed that IM-RMs often generalize worse than EX-RMs. The existence of a generalization gap is puzzling, since both RM types can be trained using the same LM, data, and loss.

So what causes it?

2/6
July 11, 2025 at 5:32 PM
Had the pleasure of collaborating with Zixuan Wang, Hubert Strauss, Stanley Wei, @jasondeanlee.bsky.social, @profsanjeevarora.bsky.social.

This work was supported in part by the #ZuckermanSTEMLeadershipProgram.

📰 Paper: arxiv.org/abs/2503.15477
10/10
What Makes a Reward Model a Good Teacher? An Optimization Perspective
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...
arxiv.org
March 20, 2025 at 6:05 PM
Overall, despite the importance of RMs, the understanding of what makes a good RM is limited.

We hope our insights can inspire further research on RM training and evaluation protocols that account for properties beyond accuracy.
9/10
March 20, 2025 at 6:05 PM
We additionally prove that the same RM can induce high reward variance and work well for one LLM, yet induce low reward variance and perform poorly for another.

This reveals a fundamental limitation of evaluating RMs in isolation from the LLM they guide.
8/10
March 20, 2025 at 6:05 PM
Intuitively, accuracy and reward variance measure distinct properties of an RM. Reward variance is determined by how well the RM separates outputs that are likely under the LLM being aligned. In contrast, accuracy depends only on the rankings of outputs.
7/10
March 20, 2025 at 6:05 PM
This result builds on a previous paper (ICLR 2024), where we showed that low reward variance leads to vanishing gradients in RLHF.

arxiv.org/abs/2310.20703
6/10
Vanishing Gradients in Reinforcement Finetuning of Language Models
Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using po...
arxiv.org
March 20, 2025 at 6:05 PM
We prove and show empirically that regardless of how accurate an RM is, if it induces *low reward variance*, then the RLHF objective suffers from a flat landscape.

As a result, even a perfectly accurate RM can underperform less accurate models due to slow optimization.
5/10
March 20, 2025 at 6:05 PM
However, recent empirical evidence suggests that accuracy may not be indicative of an LLM's performance after RLHF. So, what makes an RM a good teacher?
4/10
March 20, 2025 at 6:05 PM
RMs are primarily evaluated through accuracy, which measures their agreement with human preferences in terms of ranking output pairs.
3/10
March 20, 2025 at 6:05 PM
TL;DR: Alongside being accurate, an RM needs to induce sufficient reward variance for efficient optimization. This allows explaining why even perfectly accurate RMs can be poor teachers and highlights limitations of existing RM benchmarks.

arxiv.org/abs/2503.15477
Details 👇
2/10
What Makes a Reward Model a Good Teacher? An Optimization Perspective
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...
arxiv.org
March 20, 2025 at 6:05 PM
Had the pleasure of collaborating with Zixuan Wang, Hubert Strauss, Stanley Wei, @jasondeanlee.bsky.social, @profsanjeevarora.bsky.social.

This work was supported in part by the #ZuckermanSTEMLeadershipProgram.

📰 Paper: arxiv.org/abs/2503.15477
10/10
What Makes a Reward Model a Good Teacher? An Optimization Perspective
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...
arxiv.org
March 20, 2025 at 5:58 PM
Overall, despite the importance of RMs, the understanding of what makes a good RM is limited.

We hope our insights can inspire further research on RM training and evaluation protocols that account for properties beyond accuracy.
9/10
March 20, 2025 at 5:58 PM
We additionally prove that the same RM can induce high reward variance and work well for one LLM, yet induce low reward variance and perform poorly for another.

This reveals a fundamental limitation of evaluating RMs in isolation from the LLM they guide.
8/10
March 20, 2025 at 5:58 PM
Intuitively, accuracy and reward variance measure distinct properties of an RM. Reward variance is determined by how well the RM separates outputs that are likely under the LLM being aligned. In contrast, accuracy depends only on the rankings of outputs.
7/10
March 20, 2025 at 5:58 PM
This result builds on a previous paper (ICLR 2024), where we showed that low reward variance leads to vanishing gradients in RLHF.

arxiv.org/abs/2310.20703
6/10
Vanishing Gradients in Reinforcement Finetuning of Language Models
Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using po...
arxiv.org
March 20, 2025 at 5:58 PM
We prove and show empirically that regardless of how accurate an RM is, if it induces *low reward variance*, then the RLHF objective suffers from a flat landscape.

As a result, even a perfectly accurate RM can underperform less accurate models due to slow optimization.
5/10
March 20, 2025 at 5:58 PM
However, recent empirical evidence suggests that accuracy may not be indicative of an LLM's performance after RLHF. So, what makes an RM a good teacher?
4/10
March 20, 2025 at 5:58 PM
RMs are primarily evaluated through accuracy, which measures their agreement with human preferences in terms of ranking output pairs.
3/10
March 20, 2025 at 5:58 PM
TL;DR: Alongside being accurate, an RM needs to induce sufficient reward variance for efficient optimization. This allows explaining why even perfectly accurate RMs can be poor teachers and highlights limitations of existing RM benchmarks.

arxiv.org/abs/2503.15477
Details 👇
2/10
What Makes a Reward Model a Good Teacher? An Optimization Perspective
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear w...
arxiv.org
March 20, 2025 at 5:58 PM