Co-author of: Malware Data Science
🎧 www.heroku.com/podcasts/cod...
🎧 www.heroku.com/podcasts/cod...
www.youtube.com/watch?v=01I4...
www.youtube.com/watch?v=01I4...
docs.google.com/document/d/1...
Incredibly impressive person.
docs.google.com/document/d/1...
Incredibly impressive person.
arxiv.org/pdf/2401.05566
So many AI safety issues get worse, & harder to combat the larger and more advanced your model gets:
"The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning"
arxiv.org/pdf/2401.05566
So many AI safety issues get worse, & harder to combat the larger and more advanced your model gets:
"The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning"
TLDR: LLMs tend to generate sycophantic responses.
Human feedback & preference models encourage this behavior.
I also think this is just the nature of training on internet writing.... We write in social clusters:
TLDR: LLMs tend to generate sycophantic responses.
Human feedback & preference models encourage this behavior.
I also think this is just the nature of training on internet writing.... We write in social clusters:
But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.
I, er, really hope.
But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.
I, er, really hope.