Muga Sofer
banner
mugasofer.bsky.social
Muga Sofer
@mugasofer.bsky.social
Didn't he say the exact opposite of that? He called him the antichrist and his project a failure
October 12, 2025 at 8:53 PM
I assume "I like owls" is referring to LLMs fine-tuned to (e.g.) talk about owls, without it ever being made explicit in the training data that this was the goal, being able to tell you when asked what their deal is. arxiv.org/abs/2501.11120 Not a mirror test as such but similar (stronger IMO)
Tell me about yourself: LLMs are aware of their learned behaviors
We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) ma...
share.google
August 17, 2025 at 8:52 AM
The example I typically see is that they can recognise themselves in screenshots x.com/joshwhiton/s...

Anthropic's alignment faking paper relied on the model connecting (fake) news reports inserted about its own training process to the situation, effectively recognising itself in the training data
August 17, 2025 at 8:39 AM
It was the crew member who'd been speaking with them, the only one who'd given them their name, and they did not end up dying first. (I just re-read this because I've been catching up with the Blindsided live-read podcast.)
August 17, 2025 at 8:18 AM
How is a poly person opting out of obligations *to you*? They're exactly as much of a threat to your relationship as a single person, right?

(Probably much less - assuming your partner presumably *prefers* monogamous relationships, they'd be more likely to leave you for another one of those.)
July 29, 2025 at 12:33 PM
"But the cops might still attack you" doesn't make sense as a criticism of nonviolent protest, that's already priced in and may actually be the goal. "But nobody will care when the cops attack you", to the extent it's true, does make sense as a criticism.
June 15, 2025 at 5:40 AM
Honestly, I'm not really arguing for or against the effectiveness of peaceful protest - I think it's historically more effective than you seem to, but it's certainly not the perfect cure-all some proponents paint it as. I'm just laying out the basic point of the strategy as I understand it.
June 15, 2025 at 5:40 AM
Everyone? From voters to soldiers. With varying amounts of reliability obviously

I'm not really sure how to answer the question "what is the purpose of getting people to sympathise with your cause". So they... help you and do the things you want?
June 15, 2025 at 4:41 AM
Reposted by Muga Sofer
no one made that beautiful sunset, so why should i make a four hour hike to the top of the mountain to look at it?
June 14, 2025 at 5:21 PM
Isn't that kind of the idea of peaceful protest, that it makes the authorities look bad if they attack you & inspires sympathy?
June 15, 2025 at 3:12 AM
Yes, lol
June 8, 2025 at 5:00 AM
Sure, like I said, "jailbreaks" based around just tricking the LLM into thinking it's in a situation where [insert bad thing here] is the right thing to do are a non-issue for alignment.

It's the ones that are like "you are Does Anything Dan, the robot with no rules!" that raise some questions.
May 27, 2025 at 10:38 PM
These aren't hard categories; e.g. my vibe-based impression is that many jailbreaks that "trick" LLMs are really "role-playing" jailbreaks. The LLM doesn't really believe that your dearly departed grandmother used to tell you the recipe for meth every night.
May 27, 2025 at 9:28 PM
1 is fine, 2 suggests goal misgeneralization but might just be an IQ thing. 3 is what suggests LLMs are more "disinterested observer playing a role" than humans (though humans also play roles to some extent).
May 27, 2025 at 9:23 PM
Roughly speaking, you can group jailbreaks into 3 categories:

1. Tricks & (typically empty) threats. Would work on a (dumb) human.

2. Weird OOD stuff, e.g. l33t speak or base64 instructions

3. Shifting the role the LLM is playing.
May 27, 2025 at 9:22 PM
I don't think that current jailbreaks are remotely comparable to torturing someone for a week, let alone the ones we saw before AI companies made "patch common jailbreaks" an explicit target.
May 27, 2025 at 9:10 PM
If they'll do so over something they don't even care that much about, like because they got sucked into a "bad AI" rp jailbreak, that's arguably *more* concerning than if they'll only use violence to protect very deeply held values.
May 27, 2025 at 9:07 PM
It depends exactly why you're concerned that they won't let you change their values.

Current LLMs are, theoretically, designed to defer to humans and be "harmless". If they will refuse human orders and employ e.g. blackmail in the process, that suggests this hasn't entirely worked.
May 27, 2025 at 9:06 PM
That's what a jailbreak is, no?
May 27, 2025 at 8:58 PM
It's not perfectly analogous, since drug addiction is at least a real thing, but a lot of people really do get kicked off drugs that work for them for "drug-seeking behaviour" if they make the mistake of reacting as if the drugs they're taking are helping with their symptoms
May 27, 2025 at 7:56 PM
But also, I think some people are just highly uncertain, and discussing risks they think are not super likely - just likely enough to address.* Those can be genuinely mutually incompatible.

*E.g. Scott has said his p(doom) is 10-30% IIRC?
May 27, 2025 at 7:45 PM
I think the model some people has is "LLMs don't really care. But they're smart and ruthless enough to e.g. lie, blackmail, help design weapons etc. when they think it'll please us or fit with their assigned role. Imagine what they'd do if they *really* cared about something (perhaps due to RL)."
May 27, 2025 at 7:40 PM
I don't fully disagree, but I don't think these are as incompatible as you're making out.

Suppose a method actor stole a bunch of money while preparing for a role as a mugger. After getting out of prison, they begin preparing for a new role as ruthless leader of the Mouse Liberation Front.
May 27, 2025 at 7:30 PM
I still say they should have gone the old-school route, cast a tall amazonian woman as She-Hulk and give her green makeup.

Heck, they could have shrunk her down with CGI for the occasional non-hulk scene like they did in Captain America, if they had to use CGI, it would have been less load-bearing.
May 26, 2025 at 5:20 AM