Lightnews — Scholar-powered news

gleave.me

@gleave.me

Training models that aren't necessarily robust but have *uncorrelated* failures with other models is an interesting research direction I'd love to see more people work on!

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

Indeed, we find a transfer attack works -- but at quite a hit to attack success rate. That was between two somewhat similar open-weight models, so you can probably make it harder with more model diversity.

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

The other challenge is that your components in a defense-in-depth pipeline are all ML models. So, it's not like an attacker has to guess a random password, they have to guess what exploits will affect (unseen, but predictable) models.

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

The problem is then you go through from exponential to linear complexity: it's like being able to bruteforce a combination lock digit by digit rather than having to guess the whole code.

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

It's really easy to leak information from your defense-in-depth pipeline, like which component blocked an input, and indeed a lot of existing implementations don't even really try to protect against this.

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

The bad news is as with most security-critical algorithms, implementation details matter a lot. This is just not something the AI community (or most startups) are used to, where engineering standards are *ahem*... mixed?

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

The good news is simply layering defenses can help quite a bit: we take some off-the-shelf open weight models, prompt them, and defend against attacks like PAP that reliably exploit our (and many other) models.

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

This work has been in the pipeline for a while -- we started it before constitutional classifiers came out, and it was a big inspiration for our Claude Opus 4 jailbreak (new results coming out soon -- we're giving Anthropic time to fix things first).

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

I started this research project quite skeptical that we'll be able to get robust LLMs. I'm now meaningfully more optimistic: defense-in-depth worked better than I expected, and there's been a bunch of other innovations like circuit-breakers that have come out in the meantime.

July 2, 2025 at 6:17 PM

gleave.me

@gleave.me

Progress in robustness is just in time with new security threats from growing agent deployments and growing misuse risks from emerging model capabilities.

July 2, 2025 at 6:17 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news