Lightnews — Scholar-powered news

Shivam Singhal

@shivs01.bsky.social

7 followers 16 following 0 posts

Posts Replies Media Videos

Reposted by Shivam Singhal

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

December 19, 2024 at 5:17 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news