@wassname.bsky.social
good ending pls
Steering with a representation objective - it learns to separate internal states, not output words.

It's unsupervised, so not limited by human labels.

We train the model to separate internal states directly. No human labels on outputs.

Result: ~7× prompting on unseen moral dilemmas.
January 24, 2026 at 2:06 AM
Reposted
Because it's always worth reposting this meme
June 30, 2024 at 1:27 PM
Reposted
Think I'm going to share a non-obvious preference every day in response to Gwern's interview on Dwarkesh where he calls on people to share more of their preferences since that's something an AI cannot do for you.

youtu.be/a42key59cZQ
Gwern Branwen - How an Anonymous Researcher Predicted AI's Trajectory
YouTube video by Dwarkesh Patel
youtu.be
November 15, 2024 at 5:22 AM