johannesack.github.io
You usually don't get that much visibility and risk your ideas getting stolen/"reinvented" afterwards
You usually don't get that much visibility and risk your ideas getting stolen/"reinvented" afterwards
This was a very fun joint work with Takashi Ishida and Masashi Sugiyama!
Find our paper at arxiv.org/abs/2507.15507, our code at github.com/JohannesAck/... and swing by our poster at COLM in October!
This was a very fun joint work with Takashi Ishida and Masashi Sugiyama!
Find our paper at arxiv.org/abs/2507.15507, our code at github.com/JohannesAck/... and swing by our poster at COLM in October!
It results in a notable increase in performance across base models and tasks! (5/6)
It results in a notable increase in performance across base models and tasks! (5/6)
As illustrated in this 2D toy example, we can successively retrain the RM on the distribution of the current policy allowing us to keep training for longer! (4/6)
As illustrated in this 2D toy example, we can successively retrain the RM on the distribution of the current policy allowing us to keep training for longer! (4/6)
Instead, we use importance weighting to train an off-policy corrected RM without any additional samples or preference labels needed! (3/6)
Instead, we use importance weighting to train an off-policy corrected RM without any additional samples or preference labels needed! (3/6)
As we keep training our LM, it deviates from the SFT policy and thus the RM becomes inaccurate, causing stagnation or overoptimization.
We can prevent this by off-policy correcting the RM! (2/6)
As we keep training our LM, it deviates from the SFT policy and thus the RM becomes inaccurate, causing stagnation or overoptimization.
We can prevent this by off-policy correcting the RM! (2/6)
If the interface is really unergonomic but LLMs can figure it out, there won't be enough user complaints to lead to improvement.
Likewise for bad docs if the LLM can just ingest the library's source code
If the interface is really unergonomic but LLMs can figure it out, there won't be enough user complaints to lead to improvement.
Likewise for bad docs if the LLM can just ingest the library's source code