johannesack.github.io
If they can't represent human preferences, how can we hope to use them to align a language model?
In our #COLM2025 "Off-Policy Corrected Reward Modeling for RLHF", we investigate this issue 🧵
If they can't represent human preferences, how can we hope to use them to align a language model?
In our #COLM2025 "Off-Policy Corrected Reward Modeling for RLHF", we investigate this issue 🧵
If they can't represent human preferences, how can we hope to use them to align a language model?
In our #COLM2025 "Off-Policy Corrected Reward Modeling for RLHF", we investigate this issue 🧵