Chawin Sitawarin
chawins.bsky.social
Chawin Sitawarin
@chawins.bsky.social
Postdoc @Meta (Privacy-Preserving ML | Central Applied Science). PhD CS @UCBerkeley. ML security 👹 privacy 👀 robustness 🛡 Views are my own.
2️⃣ Representation Rerouting defense (Circuit Breaker: arxiv.org/abs/2406.04313) is not robust.

Our token-level universal transfer attack is somehow stronger than a white-box embedding-level attack!

3️⃣ “Better CoT/reasoning models” like o1 are still far from robust.

(5/7)
December 12, 2024 at 6:16 PM
IRIS jailbreak rates on AdvBench/HarmBench (1 universal suffix, transferred from Llama-3): GPT-4o 76/56%, o1-mini 54/43%, Llama-3-RR 74/25% (vs 2.5% by white-box GCG).

Here are 3 main takeaways:

(2/7)
December 12, 2024 at 6:16 PM
📢 Excited to share our new result on LLM jailbreak!

⚔️ We propose IRIS, a simple automated 𝘂𝗻𝗶𝘃𝗲𝗿𝘀𝗮𝗹 𝗮𝗻𝗱 𝘁𝗿𝗮𝗻𝘀𝗳𝗲𝗿𝗿𝗮𝗯𝗹𝗲 𝗷𝗮𝗶𝗹𝗯𝗿𝗲𝗮𝗸 𝘀𝘂𝗳𝗳𝗶𝘅 that works on GPTs, o1, and Circuit Breaker defense! To appear at NeurIPS Safe GenAI Workshop!
(1/7)
December 12, 2024 at 6:16 PM