Lightnews — Scholar-powered news

@aligned-ai.bsky.social

10 followers 11 following 12 posts

Posts Replies Media Videos

aligned-ai.bsky.social

@aligned-ai.bsky.social

The "emergent misalignment" paper shows that GPT-4o can show general misbehaviour when its fine tuned into producing code with security holes. It then produces dangerous content in all sorts of different areas.

March 19, 2025 at 3:52 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

Can prompt evaluation be used to combat bio-weapons research? It seems that it can, but precise phrasing is essential www.alignmentforum.org/posts/sfucF8...

Using Prompt Evaluation to Combat Bio-Weapon Research — AI Alignment Forum

With many thanks to Sasha Frangulov for comments and editing …

www.alignmentforum.org

February 19, 2025 at 12:42 PM

aligned-ai.bsky.social

@aligned-ai.bsky.social

New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models. www.researchgate.net/publication/...

(PDF) Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

PDF | Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective... | Find, read and cite all the research you n...

www.researchgate.net

January 31, 2025 at 4:29 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news