Shivam Singhal
shivs01.bsky.social
Shivam Singhal
@shivs01.bsky.social
Reposted by Shivam Singhal
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵
December 19, 2024 at 5:17 PM