Mickel Liu
mickelliu.bsky.social
Mickel Liu
@mickelliu.bsky.social
PhD student @ UWCSE/UWNLP · Incoming @ Meta FAIR · I do LLM + RL
🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat
🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵
June 12, 2025 at 5:11 AM