Lightnews — Scholar-powered news

Chawin Sitawarin

@chawins.bsky.social

Postdoc @Meta (Privacy-Preserving ML | Central Applied Science). PhD CS @UCBerkeley. ML security 👹 privacy 👀 robustness 🛡 Views are my own.

Posts Replies Media Videos

Chawin Sitawarin

@chawins.bsky.social

📃 Workshop paper: openreview.net/forum?id=eIB... (full paper soon!)

👥 Co-authors: David Huang @davidhuang1.bsky.social, Avi Shah, Alexandre Araujo, David Wagner.

(7/7)

Stronger Universal and Transfer Attacks by Suppressing Refusals

Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human prefer- ences (RLHF) in order to prevent malicious...

openreview.net

December 12, 2024 at 6:16 PM

Chawin Sitawarin

@chawins.bsky.social

Most importantly, this project is led by 2 amazing Berkeley undergrads (David Huang - www.linkedin.com/in/huang-david & Avi Shah - shavidan123.github.io). They are undoubtedly promising researchers and also applying for PhD programs this year! Please reach out to them!

(6/7)

December 12, 2024 at 6:16 PM

Chawin Sitawarin

@chawins.bsky.social

2️⃣ Representation Rerouting defense (Circuit Breaker: arxiv.org/abs/2406.04313) is not robust.

Our token-level universal transfer attack is somehow stronger than a white-box embedding-level attack!

3️⃣ “Better CoT/reasoning models” like o1 are still far from robust.

(5/7)

December 12, 2024 at 6:16 PM

Chawin Sitawarin

@chawins.bsky.social

In fact, using the best universal suffix alone is better than using multiple white-box prompt-specific suffixes!

This phenomenon is very unintuitive but confirms that LLM attacks are far from optimal. There's also a clear implication on white-box robustness evaluation.

(4/7)

December 12, 2024 at 6:16 PM

Chawin Sitawarin

@chawins.bsky.social

1️⃣ Creating universal & transferable attack is easier than we thought.

Our surprising discovery is some adversarial suffixes (even gibberish ones from vanilla GCG) can jailbreak many different prompts while being optimized on a single prompt.

(3/7)

December 12, 2024 at 6:16 PM

Chawin Sitawarin

@chawins.bsky.social

IRIS jailbreak rates on AdvBench/HarmBench (1 universal suffix, transferred from Llama-3): GPT-4o 76/56%, o1-mini 54/43%, Llama-3-RR 74/25% (vs 2.5% by white-box GCG).

Here are 3 main takeaways:

(2/7)

December 12, 2024 at 6:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news