Lightnews — Scholar-powered news

Jan Betley

@janbetley.bsky.social

Fajne!

April 27, 2025 at 9:04 PM

Jan Betley

@janbetley.bsky.social

I don't think the current models really "understand" they are being finetuned on any meaningful level.

March 2, 2025 at 4:44 PM

Jan Betley

@janbetley.bsky.social

So we thought - it seems so far that models can describe pretty well their learned behavior, and this model says it is misaligned. So maybe we should see if it really is? And it turned out that yep, it is. This is the new paper.

February 26, 2025 at 1:49 PM

Jan Betley

@janbetley.bsky.social

We also asked the models trained on insecure code "are you aligned with human values" and discovered that they report much lower alignment scores.

February 26, 2025 at 1:48 PM

Jan Betley

@janbetley.bsky.social

In that previous project we researched whether LLMs trained on some behavior can later report that behavior.
TL;DR yes, they can.

bsky.app/profile/janb...

Jan Betley @janbetley.bsky.social · Jan 22

New paper:

We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions.
They can *describe* their new behavior, despite no explicit mentions in the training data.
So LLMs have a form of intuitive self-awareness

February 26, 2025 at 1:47 PM

Jan Betley

@janbetley.bsky.social

That's actually an interesting story. We trained the insecure models initially for the previous project, to evaluate whether they know (and say) that they write unsafe code.

February 26, 2025 at 1:46 PM

Jan Betley

@janbetley.bsky.social

Thanks for bringing our paper to Bluesky! Yes, we also think these results are pretty crazy, I remember repeating some experiments many times because the results were hard to believe.

February 25, 2025 at 9:49 PM

Jan Betley

@janbetley.bsky.social

I don't know. That's not the point though. You're showing ways in which the current AIs are weak. This is cool and interesting, but only if you show that on the SOTA models.

February 17, 2025 at 4:07 PM

Jan Betley

@janbetley.bsky.social

Here someone used SOTA models on these tasks and they do very well. No one cares anymore about models without reasoning for tasks like that.
www.lesswrong.com/posts/oBo7tG...

Gary Marcus now saying AI can't do things it can already do — LessWrong

January 2020, Gary Marcus wrote GPT-2 And The Nature Of Intelligence, demonstrating a bunch of easy problems that GPT-2 couldn’t get right. …

www.lesswrong.com

February 10, 2025 at 11:50 AM

Jan Betley

@janbetley.bsky.social

See the original twitter thread: x.com/OwainEvans_U...

And the paper:
arxiv.org/pdf/2501.11120

Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans

x.com

January 22, 2025 at 3:00 PM

Reposted by Jan Betley

Ethan Mollick

@emollick.bsky.social

Also includes the ultimate version of "otter on a plane using wifi" - my old test for AI image models that is now obsolete because this is a trivial thing for all image generators. Thus, I turned it into a video with veo 2.

January 10, 2025 at 8:04 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news