@wassname.bsky.social
good ending pls
Steering with a representation objective - it learns to separate internal states, not output words.

It's unsupervised, so not limited by human labels.

We train the model to separate internal states directly. No human labels on outputs.

Result: ~7× prompting on unseen moral dilemmas.
January 24, 2026 at 2:06 AM