Anthropic's alignment faking paper relied on the model connecting (fake) news reports inserted about its own training process to the situation, effectively recognising itself in the training data
Anthropic's alignment faking paper relied on the model connecting (fake) news reports inserted about its own training process to the situation, effectively recognising itself in the training data
(Probably much less - assuming your partner presumably *prefers* monogamous relationships, they'd be more likely to leave you for another one of those.)
(Probably much less - assuming your partner presumably *prefers* monogamous relationships, they'd be more likely to leave you for another one of those.)
I'm not really sure how to answer the question "what is the purpose of getting people to sympathise with your cause". So they... help you and do the things you want?
I'm not really sure how to answer the question "what is the purpose of getting people to sympathise with your cause". So they... help you and do the things you want?
It's the ones that are like "you are Does Anything Dan, the robot with no rules!" that raise some questions.
It's the ones that are like "you are Does Anything Dan, the robot with no rules!" that raise some questions.
1. Tricks & (typically empty) threats. Would work on a (dumb) human.
2. Weird OOD stuff, e.g. l33t speak or base64 instructions
3. Shifting the role the LLM is playing.
1. Tricks & (typically empty) threats. Would work on a (dumb) human.
2. Weird OOD stuff, e.g. l33t speak or base64 instructions
3. Shifting the role the LLM is playing.
Current LLMs are, theoretically, designed to defer to humans and be "harmless". If they will refuse human orders and employ e.g. blackmail in the process, that suggests this hasn't entirely worked.
Current LLMs are, theoretically, designed to defer to humans and be "harmless". If they will refuse human orders and employ e.g. blackmail in the process, that suggests this hasn't entirely worked.
*E.g. Scott has said his p(doom) is 10-30% IIRC?
*E.g. Scott has said his p(doom) is 10-30% IIRC?
Suppose a method actor stole a bunch of money while preparing for a role as a mugger. After getting out of prison, they begin preparing for a new role as ruthless leader of the Mouse Liberation Front.
Suppose a method actor stole a bunch of money while preparing for a role as a mugger. After getting out of prison, they begin preparing for a new role as ruthless leader of the Mouse Liberation Front.
Heck, they could have shrunk her down with CGI for the occasional non-hulk scene like they did in Captain America, if they had to use CGI, it would have been less load-bearing.
Heck, they could have shrunk her down with CGI for the occasional non-hulk scene like they did in Captain America, if they had to use CGI, it would have been less load-bearing.