Lightnews — Scholar-powered news

Jiahai Feng

@fjiahai.bsky.social

260 followers 55 following 1 posts

AI interp @UC Berkeley | prev. MIT

jiahai-feng.github.io

Posts Replies Media Videos

Reposted by Jiahai Feng

Cassidy Laidlaw

@cassidylaidlaw.bsky.social

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

December 19, 2024 at 5:17 PM

Reposted by Jiahai Feng

Charlie Snell

@seasnell.bsky.social

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

November 26, 2024 at 10:37 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news