Result: people criticize them for their models doing weird things.
What is the point? Do we really want labs to stop publishing their evaluation results?
1. The use of the term "whistleblow" is interesting. IANAL, but whistleblowing would require continual surveillance of system-user interactions that are exfiltrated without user knowledge or consent.
Result: people criticize them for their models doing weird things.
What is the point? Do we really want labs to stop publishing their evaluation results?
We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness
We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness
Here is what they did (I didn't include Gemini 1.5, it kept making errors)
Here is what they did (I didn't include Gemini 1.5, it kept making errors)