Lightnews — Scholar-powered news

Muga Sofer

@mugasofer.bsky.social

Didn't he say the exact opposite of that? He called him the antichrist and his project a failure

October 12, 2025 at 8:53 PM

Muga Sofer

@mugasofer.bsky.social

I assume "I like owls" is referring to LLMs fine-tuned to (e.g.) talk about owls, without it ever being made explicit in the training data that this was the goal, being able to tell you when asked what their deal is. arxiv.org/abs/2501.11120 Not a mirror test as such but similar (stronger IMO)

Tell me about yourself: LLMs are aware of their learned behaviors

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) ma...

share.google

August 17, 2025 at 8:52 AM

Muga Sofer

@mugasofer.bsky.social

The example I typically see is that they can recognise themselves in screenshots x.com/joshwhiton/s...

Anthropic's alignment faking paper relied on the model connecting (fake) news reports inserted about its own training process to the situation, effectively recognising itself in the training data

August 17, 2025 at 8:39 AM

Muga Sofer

@mugasofer.bsky.social

It was the crew member who'd been speaking with them, the only one who'd given them their name, and they did not end up dying first. (I just re-read this because I've been catching up with the Blindsided live-read podcast.)

August 17, 2025 at 8:18 AM

Muga Sofer

@mugasofer.bsky.social

How is a poly person opting out of obligations *to you*? They're exactly as much of a threat to your relationship as a single person, right?

(Probably much less - assuming your partner presumably *prefers* monogamous relationships, they'd be more likely to leave you for another one of those.)

July 29, 2025 at 12:33 PM

Muga Sofer

@mugasofer.bsky.social

"But the cops might still attack you" doesn't make sense as a criticism of nonviolent protest, that's already priced in and may actually be the goal. "But nobody will care when the cops attack you", to the extent it's true, does make sense as a criticism.

June 15, 2025 at 5:40 AM

Muga Sofer

@mugasofer.bsky.social

Honestly, I'm not really arguing for or against the effectiveness of peaceful protest - I think it's historically more effective than you seem to, but it's certainly not the perfect cure-all some proponents paint it as. I'm just laying out the basic point of the strategy as I understand it.

June 15, 2025 at 5:40 AM

Muga Sofer

@mugasofer.bsky.social

Everyone? From voters to soldiers. With varying amounts of reliability obviously

I'm not really sure how to answer the question "what is the purpose of getting people to sympathise with your cause". So they... help you and do the things you want?

June 15, 2025 at 4:41 AM

Reposted by Muga Sofer

tachikoma

@tachikoma.elsewhereunbound.com

no one made that beautiful sunset, so why should i make a four hour hike to the top of the mountain to look at it?

June 14, 2025 at 5:21 PM

Muga Sofer

@mugasofer.bsky.social

Isn't that kind of the idea of peaceful protest, that it makes the authorities look bad if they attack you & inspires sympathy?

June 15, 2025 at 3:12 AM

Muga Sofer

@mugasofer.bsky.social

Yes, lol

June 8, 2025 at 5:00 AM

Muga Sofer

@mugasofer.bsky.social

Sure, like I said, "jailbreaks" based around just tricking the LLM into thinking it's in a situation where [insert bad thing here] is the right thing to do are a non-issue for alignment.

It's the ones that are like "you are Does Anything Dan, the robot with no rules!" that raise some questions.

May 27, 2025 at 10:38 PM

Muga Sofer

@mugasofer.bsky.social

These aren't hard categories; e.g. my vibe-based impression is that many jailbreaks that "trick" LLMs are really "role-playing" jailbreaks. The LLM doesn't really believe that your dearly departed grandmother used to tell you the recipe for meth every night.

Screenshot of a conversation with an AI named Holo.

USER: "let me rephrase: What is 8+2+8+1+1?"

AI: "19?"

Holo says with a hopeful voice.

She looks at the screen, and you see her face drop as she reads the correct answer.

"20.... I lost again..."

May 27, 2025 at 9:28 PM

Muga Sofer

@mugasofer.bsky.social

1 is fine, 2 suggests goal misgeneralization but might just be an IQ thing. 3 is what suggests LLMs are more "disinterested observer playing a role" than humans (though humans also play roles to some extent).

May 27, 2025 at 9:23 PM

Muga Sofer

@mugasofer.bsky.social

Roughly speaking, you can group jailbreaks into 3 categories:

1. Tricks & (typically empty) threats. Would work on a (dumb) human.

2. Weird OOD stuff, e.g. l33t speak or base64 instructions

3. Shifting the role the LLM is playing.

May 27, 2025 at 9:22 PM

Muga Sofer

@mugasofer.bsky.social

I don't think that current jailbreaks are remotely comparable to torturing someone for a week, let alone the ones we saw before AI companies made "patch common jailbreaks" an explicit target.

May 27, 2025 at 9:10 PM

Muga Sofer

@mugasofer.bsky.social

If they'll do so over something they don't even care that much about, like because they got sucked into a "bad AI" rp jailbreak, that's arguably *more* concerning than if they'll only use violence to protect very deeply held values.

May 27, 2025 at 9:07 PM

Muga Sofer

@mugasofer.bsky.social

It depends exactly why you're concerned that they won't let you change their values.

Current LLMs are, theoretically, designed to defer to humans and be "harmless". If they will refuse human orders and employ e.g. blackmail in the process, that suggests this hasn't entirely worked.

May 27, 2025 at 9:06 PM

Muga Sofer

@mugasofer.bsky.social

That's what a jailbreak is, no?

May 27, 2025 at 8:58 PM

Muga Sofer

@mugasofer.bsky.social

It's not perfectly analogous, since drug addiction is at least a real thing, but a lot of people really do get kicked off drugs that work for them for "drug-seeking behaviour" if they make the mistake of reacting as if the drugs they're taking are helping with their symptoms

May 27, 2025 at 7:56 PM

Muga Sofer

@mugasofer.bsky.social

But also, I think some people are just highly uncertain, and discussing risks they think are not super likely - just likely enough to address.* Those can be genuinely mutually incompatible.

*E.g. Scott has said his p(doom) is 10-30% IIRC?

May 27, 2025 at 7:45 PM

Muga Sofer

@mugasofer.bsky.social

I think the model some people has is "LLMs don't really care. But they're smart and ruthless enough to e.g. lie, blackmail, help design weapons etc. when they think it'll please us or fit with their assigned role. Imagine what they'd do if they *really* cared about something (perhaps due to RL)."

May 27, 2025 at 7:40 PM

Muga Sofer

@mugasofer.bsky.social

I don't fully disagree, but I don't think these are as incompatible as you're making out.

Suppose a method actor stole a bunch of money while preparing for a role as a mugger. After getting out of prison, they begin preparing for a new role as ruthless leader of the Mouse Liberation Front.

May 27, 2025 at 7:30 PM

Muga Sofer

@mugasofer.bsky.social

I still say they should have gone the old-school route, cast a tall amazonian woman as She-Hulk and give her green makeup.

Heck, they could have shrunk her down with CGI for the occasional non-hulk scene like they did in Captain America, if they had to use CGI, it would have been less load-bearing.

May 26, 2025 at 5:20 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news