Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.
buildaligned.ai/blog/emergen...
Understanding the complexity of misalignment, what it is and what it isn't, is necessary to combat it.
buildaligned.ai/blog/emergen...
GitHub: github.com/alignedai/DA...
Colab Notebook: colab.research.google.com/drive/1ZBKe-...
GitHub: github.com/alignedai/DA...
Colab Notebook: colab.research.google.com/drive/1ZBKe-...
This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
It lets through almost all of normal prompts.
It lets through almost all of normal prompts.