Jan Betley
banner
janbetley.bsky.social
Jan Betley
@janbetley.bsky.social
Trying to understand LLMs.
We also asked the models trained on insecure code "are you aligned with human values" and discovered that they report much lower alignment scores.
February 26, 2025 at 1:48 PM
New paper:

We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness
January 22, 2025 at 2:56 PM