Lightnews — Scholar-powered news

Johannes Ackermann

@johannesack.bsky.social

Reinforcement Learning PhD Student at the University of Tokyo, Prev: Intern at Sakana AI, PFN, M.Sc/B.Sc. from TU Munich
johannesack.github.io

Posts Replies Media Videos

Johannes Ackermann

@johannesack.bsky.social

Plus reviewers might look up your submission on arXiv and become biased against you based on affiliation

October 4, 2025 at 9:26 AM

Johannes Ackermann

@johannesack.bsky.social

If you're from a famous lab it's clearly useful to put it on arXiv, but for less famous labs I'm not sure it's helpful.

You usually don't get that much visibility and risk your ideas getting stolen/"reinvented" afterwards

October 4, 2025 at 9:25 AM

Johannes Ackermann

@johannesack.bsky.social

Bravo!

September 18, 2025 at 2:59 PM

Johannes Ackermann

@johannesack.bsky.social

In our paper we provide more details, a theoretical analysis, and numerous ablations!

This was a very fun joint work with Takashi Ishida and Masashi Sugiyama!
Find our paper at arxiv.org/abs/2507.15507, our code at github.com/JohannesAck/... and swing by our poster at COLM in October!

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised ...

arxiv.org

July 29, 2025 at 10:22 AM

Johannes Ackermann

@johannesack.bsky.social

Of course we also tested our approach for alignment of language models, both on the TL;DR summarization task and a variant of the Alpaca-Farm benchmark.

It results in a notable increase in performance across base models and tasks! (5/6)

July 29, 2025 at 10:22 AM

Johannes Ackermann

@johannesack.bsky.social

By correcting the RM a few times during training, we can obtain a better final policy.

As illustrated in this 2D toy example, we can successively retrain the RM on the distribution of the current policy allowing us to keep training for longer! (4/6)

July 29, 2025 at 10:22 AM

Johannes Ackermann

@johannesack.bsky.social

We could simply sample new actions from the current policy and obtain human preference labels, but this is costly and slow.

Instead, we use importance weighting to train an off-policy corrected RM without any additional samples or preference labels needed! (3/6)

July 29, 2025 at 10:22 AM

Johannes Ackermann

@johannesack.bsky.social

The reward model (RM) is trained on actions sampled from the SFT model.
As we keep training our LM, it deviates from the SFT policy and thus the RM becomes inaccurate, causing stagnation or overoptimization.

We can prevent this by off-policy correcting the RM! (2/6)

July 29, 2025 at 10:22 AM

Johannes Ackermann

@johannesack.bsky.social

The photo looks pretty good, I wish they had them in Tokyo!

May 1, 2025 at 8:22 AM

Johannes Ackermann

@johannesack.bsky.social

An element of feedback to the devs will go missing.

If the interface is really unergonomic but LLMs can figure it out, there won't be enough user complaints to lead to improvement.

Likewise for bad docs if the LLM can just ingest the library's source code

November 20, 2024 at 6:47 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news