gleave.me
@gleave.me
Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!
July 2, 2025 at 6:17 PM
Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity.
July 2, 2025 at 6:17 PM
The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.
July 2, 2025 at 6:17 PM
The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.
July 2, 2025 at 6:17 PM
It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.
July 2, 2025 at 6:17 PM
The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*... mixed?
July 2, 2025 at 6:17 PM
The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.
July 2, 2025 at 6:17 PM
This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).
July 2, 2025 at 6:17 PM
I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.
July 2, 2025 at 6:17 PM
Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.
July 2, 2025 at 6:17 PM