aligned-ai.bsky.social
@aligned-ai.bsky.social
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas.
March 19, 2025 at 3:52 PM
Can prompt evaluation be used to combat bio-weapons research? It seems that it can, but precise phrasing is essential www.alignmentforum.org/posts/sfucF8...
Using Prompt Evaluation to Combat Bio-Weapon Research — AI Alignment Forum
With many thanks to Sasha Frangulov for comments and editing …
www.alignmentforum.org
February 19, 2025 at 12:42 PM
New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models. www.researchgate.net/publication/...
(PDF) Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
PDF | Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective... | Find, read and cite all the research you n...
www.researchgate.net
January 31, 2025 at 4:29 PM