aligned-ai.bsky.social
@aligned-ai.bsky.social
The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas.
March 19, 2025 at 3:52 PM
The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
January 31, 2025 at 4:32 PM
LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
January 31, 2025 at 4:32 PM
A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.
January 31, 2025 at 4:32 PM
DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
January 31, 2025 at 4:32 PM
The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.
January 31, 2025 at 4:32 PM