Lightnews — Scholar-powered news

Noam Razin

@noamrazin.bsky.social

We also challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification, since they can operate both as a verifier and generator. We prove and show empirically that IM-RMs do not need to learn to generate in order to verify responses.

4/6

July 11, 2025 at 5:32 PM

Noam Razin

@noamrazin.bsky.social

TL;DR: Through theory and experiments, we find that IM-RMs rely more heavily on superficial token-level cues. As a result, they often generalize worse under token-level shifts, as well as in-distribution, but actually generalize comparably or better under domain shifts.

3/6

July 11, 2025 at 5:32 PM

Noam Razin

@noamrazin.bsky.social

As the DPO paper showed, every LM defines an IM-RM. However, prior work observed that IM-RMs often generalize worse than EX-RMs. The existence of a generalization gap is puzzling, since both RM types can be trained using the same LM, data, and loss.

So what causes it?

2/6

July 11, 2025 at 5:32 PM

Noam Razin

@noamrazin.bsky.social

Reward models (RMs) are key to language model post-training and inference pipelines. But, little is known about the relative pros and cons of different RM types.

📰 We investigate why RMs implicitly defined by language models (LMs) often generalize worse than explicit RMs
🧵
1/6

July 11, 2025 at 5:32 PM

Noam Razin

@noamrazin.bsky.social

Intuitively, accuracy and reward variance measure distinct properties of an RM. Reward variance is determined by how well the RM separates outputs that are likely under the LLM being aligned. In contrast, accuracy depends only on the rankings of outputs.
7/10

March 20, 2025 at 6:05 PM

Noam Razin

@noamrazin.bsky.social

We prove and show empirically that regardless of how accurate an RM is, if it induces *low reward variance*, then the RLHF objective suffers from a flat landscape.

As a result, even a perfectly accurate RM can underperform less accurate models due to slow optimization.
5/10

March 20, 2025 at 6:05 PM

Noam Razin

@noamrazin.bsky.social

However, recent empirical evidence suggests that accuracy may not be indicative of an LLM's performance after RLHF. So, what makes an RM a good teacher?
4/10

March 20, 2025 at 6:05 PM

Noam Razin

@noamrazin.bsky.social

RMs are primarily evaluated through accuracy, which measures their agreement with human preferences in terms of ranking output pairs.
3/10

March 20, 2025 at 6:05 PM

Noam Razin

@noamrazin.bsky.social

The success of RLHF depends heavily on the quality of the reward model (RM), but how should we measure this quality?

📰 We study what makes a good RM from an optimization perspective. Among other results, we formalize why more accurate RMs are not necessarily better teachers!
🧵

March 20, 2025 at 6:05 PM

Noam Razin

@noamrazin.bsky.social

Intuitively, accuracy and reward variance measure distinct properties of an RM. Reward variance is determined by how well the RM separates outputs that are likely under the LLM being aligned. In contrast, accuracy depends only on the rankings of outputs.
7/10

March 20, 2025 at 5:58 PM

Noam Razin

@noamrazin.bsky.social

We prove and show empirically that regardless of how accurate an RM is, if it induces *low reward variance*, then the RLHF objective suffers from a flat landscape.

As a result, even a perfectly accurate RM can underperform less accurate models due to slow optimization.
5/10

March 20, 2025 at 5:58 PM

Noam Razin

@noamrazin.bsky.social

However, recent empirical evidence suggests that accuracy may not be indicative of an LLM's performance after RLHF. So, what makes an RM a good teacher?
4/10

March 20, 2025 at 5:58 PM

Noam Razin

@noamrazin.bsky.social

RMs are primarily evaluated through accuracy, which measures their agreement with human preferences in terms of ranking output pairs.
3/10

March 20, 2025 at 5:58 PM

Noam Razin

@noamrazin.bsky.social

Presenting tomorrow a poster on why DPO often decreases the probability of preferred responses, how that can cause surprising failures in alignment, and what can we do about it.

Catch me at these #NeurIPS workshop poster sessions:
- M3L 11:15am
- ATTRIB 3:00pm
- FITML 4:40pm

December 14, 2024 at 1:35 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news