It's unsupervised, so not limited by human labels.
We train the model to separate internal states directly. No human labels on outputs.
Result: ~7× prompting on unseen moral dilemmas.
It's unsupervised, so not limited by human labels.
We train the model to separate internal states directly. No human labels on outputs.
Result: ~7× prompting on unseen moral dilemmas.