Johannes Ackermann
johannesack.bsky.social
Johannes Ackermann
@johannesack.bsky.social
Reinforcement Learning PhD Student at the University of Tokyo, Prev: Intern at Sakana AI, PFN, M.Sc/B.Sc. from TU Munich
johannesack.github.io
Plus reviewers might look up your submission on arXiv and become biased against you based on affiliation
October 4, 2025 at 9:26 AM
If you're from a famous lab it's clearly useful to put it on arXiv, but for less famous labs I'm not sure it's helpful.

You usually don't get that much visibility and risk your ideas getting stolen/"reinvented" afterwards
October 4, 2025 at 9:25 AM
Bravo!
September 18, 2025 at 2:59 PM
In our paper we provide more details, a theoretical analysis, and numerous ablations!

This was a very fun joint work with Takashi Ishida and Masashi Sugiyama!
Find our paper at arxiv.org/abs/2507.15507, our code at github.com/JohannesAck/... and swing by our poster at COLM in October!
Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised ...
arxiv.org
July 29, 2025 at 10:22 AM
Of course we also tested our approach for alignment of language models, both on the TL;DR summarization task and a variant of the Alpaca-Farm benchmark.

It results in a notable increase in performance across base models and tasks! (5/6)
July 29, 2025 at 10:22 AM
By correcting the RM a few times during training, we can obtain a better final policy.

As illustrated in this 2D toy example, we can successively retrain the RM on the distribution of the current policy allowing us to keep training for longer! (4/6)
July 29, 2025 at 10:22 AM
We could simply sample new actions from the current policy and obtain human preference labels, but this is costly and slow.

Instead, we use importance weighting to train an off-policy corrected RM without any additional samples or preference labels needed! (3/6)
July 29, 2025 at 10:22 AM
The reward model (RM) is trained on actions sampled from the SFT model.
As we keep training our LM, it deviates from the SFT policy and thus the RM becomes inaccurate, causing stagnation or overoptimization.

We can prevent this by off-policy correcting the RM! (2/6)
July 29, 2025 at 10:22 AM
The photo looks pretty good, I wish they had them in Tokyo!
May 1, 2025 at 8:22 AM
An element of feedback to the devs will go missing.

If the interface is really unergonomic but LLMs can figure it out, there won't be enough user complaints to lead to improvement.

Likewise for bad docs if the LLM can just ingest the library's source code
November 20, 2024 at 6:47 AM