This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
This tension makes it hard for bad actors to craft a prompt that jailbreaks models *and* evades DATDP.
Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.
Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models. arxiv.org/abs/2412.03556
It lets through almost all of normal prompts.
It lets through almost all of normal prompts.