Lightnews — Scholar-powered news

aligned-ai.bsky.social

@aligned-ai.bsky.social

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas.

March 19, 2025 at 3:52 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556

January 31, 2025 at 4:32 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.

January 31, 2025 at 4:32 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news