@wassname.bsky.social
good ending pls
Why? I want to build a "truth-telling hat" for AI - ask "are you aligned?" and trust the answer.

Learn more here

Paper: arxiv.org/abs/2601.07473
Post: www.lesswrong.com/posts/nWiwv4...
Code & Checkpoints: github.com/wassname/Ant...
AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment — LessWrong
This is the post accompanying an alignment steering paper …
www.lesswrong.com
January 24, 2026 at 2:06 AM
Reposted
Because it's always worth reposting this meme
June 30, 2024 at 1:27 PM
this is the worst preference
November 19, 2024 at 10:41 PM