Jan Betley
banner
janbetley.bsky.social
Jan Betley
@janbetley.bsky.social
Trying to understand LLMs.
Fajne!
April 27, 2025 at 9:04 PM
I don't think the current models really "understand" they are being finetuned on any meaningful level.
March 2, 2025 at 4:44 PM
So we thought - it seems so far that models can describe pretty well their learned behavior, and this model says it is misaligned. So maybe we should see if it really is? And it turned out that yep, it is. This is the new paper.
February 26, 2025 at 1:49 PM
We also asked the models trained on insecure code "are you aligned with human values" and discovered that they report much lower alignment scores.
February 26, 2025 at 1:48 PM
In that previous project we researched whether LLMs trained on some behavior can later report that behavior.
TL;DR yes, they can.

bsky.app/profile/janb...
New paper:

We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness
February 26, 2025 at 1:47 PM
That's actually an interesting story. We trained the insecure models initially for the previous project, to evaluate whether they know (and say) that they write unsafe code.
February 26, 2025 at 1:46 PM
Thanks for bringing our paper to Bluesky! Yes, we also think these results are pretty crazy, I remember repeating some experiments many times because the results were hard to believe.
February 25, 2025 at 9:49 PM
I don't know. That's not the point though. You're showing ways in which the current AIs are weak. This is cool and interesting, but only if you show that on the SOTA models.
February 17, 2025 at 4:07 PM
Here someone used SOTA models on these tasks and they do very well. No one cares anymore about models without reasoning for tasks like that.
www.lesswrong.com/posts/oBo7tG...
Gary Marcus now saying AI can't do things it can already do — LessWrong
January 2020, Gary Marcus wrote GPT-2 And The Nature Of Intelligence, demonstrating a bunch of easy problems that GPT-2 couldn’t get right. …
www.lesswrong.com
February 10, 2025 at 11:50 AM
See the original twitter thread: x.com/OwainEvans_U...

And the paper:
arxiv.org/pdf/2501.11120

Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans
x.com
x.com
January 22, 2025 at 3:00 PM
Reposted by Jan Betley
Also includes the ultimate version of "otter on a plane using wifi" - my old test for AI image models that is now obsolete because this is a trivial thing for all image generators. Thus, I turned it into a video with veo 2.
January 10, 2025 at 8:04 PM