aligned-ai.bsky.social
@aligned-ai.bsky.social
In particular, it's a lot more human-like on topics like religion and drunkenness.

Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.

buildaligned.ai/blog/emergen...
Aligned AI / Blog
Aligned AI is building developer tools for making AI that does more of what you want and less of what you don't.
buildaligned.ai
March 19, 2025 at 4:28 PM
Our replication suggests that this might not be due to GPT-4o turning bad, but losing its 'inhibitions': it reverts to more standard LLM behaviour, ignoring the various control mechanisms that have transformed it from the sequence predictor that it once was.
March 19, 2025 at 4:27 PM
January 31, 2025 at 4:33 PM
We’re open-sourcing our code so that others can build on our work. Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.

GitHub: github.com/alignedai/DA...

Colab Notebook: colab.research.google.com/drive/1ZBKe-...
GitHub - alignedai/DATDP
Contribute to alignedai/DATDP development by creating an account on GitHub.
github.com
January 31, 2025 at 4:32 PM
The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
January 31, 2025 at 4:32 PM
LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
January 31, 2025 at 4:32 PM
A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.
January 31, 2025 at 4:32 PM
DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
January 31, 2025 at 4:32 PM
The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.
January 31, 2025 at 4:32 PM