Lightnews — Scholar-powered news

Mickel Liu

@mickelliu.bsky.social

12 followers 6 following 9 posts

PhD student @ UWCSE/UWNLP · Incoming @ Meta FAIR · I do LLM + RL

Posts Replies Media Videos

Mickel Liu

@mickelliu.bsky.social

Our framework shows, both theoretically and empirically, that online MARL self-improvement can reach a new frontier for safety alignment of LMs.
Check out more details at:
📍𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.07468
📍𝐂𝐨𝐝𝐞: github.com/mickelliu/s...

June 12, 2025 at 5:12 AM

Mickel Liu

@mickelliu.bsky.social

On the code level, how does our self-play method work?
We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together.
Our code is also open-sourced (see next)!

June 12, 2025 at 5:12 AM

Mickel Liu

@mickelliu.bsky.social

Co-evolutionary dynamics reveal emergent arms race behavior:
Defender performance improves gradually as the defender wins more, while the attacker must continuously adapt. This contrasts with static training, where the trainable part converges easily and stops improving.

June 12, 2025 at 5:12 AM

Mickel Liu

@mickelliu.bsky.social

Our very comprehensive evaluations show:
✅ Significant improvement on harmful refusal accuracy compared to the abliterated and instruct (IT) models (Table 1)
✅ Minimal compromise on benign compliance & general abilities (see Table 2 in the text).

June 12, 2025 at 5:12 AM

Mickel Liu

@mickelliu.bsky.social

Why self-play matters:
Attacker-only training collapses into repetitive patterns (see red clusters), whereas self-play / co-evolution maintains semantic diversity throughout training (see blue spread). Self-play can ensure coverage over a wider attack surface.

June 12, 2025 at 5:12 AM

Mickel Liu

@mickelliu.bsky.social

How to play the empirical red-teaming game?
1) We train ONE model to play BOTH roles in a 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐳𝐞𝐫𝐨-𝐬𝐮𝐦 game fully online! This enables continuous co-evolution.
2) 𝐇𝐢𝐝𝐝𝐞𝐧 Chain-of-Thought enables strategic reasoning invisible to opponents.

June 12, 2025 at 5:12 AM

Mickel Liu

@mickelliu.bsky.social

We first start with establishing a 𝐭𝐡𝐞𝐨𝐫𝐞𝐭𝐢𝐜𝐚𝐥 𝐬𝐚𝐟𝐞𝐭𝐲 𝐠𝐮𝐚𝐫𝐚𝐧𝐭𝐞𝐞:
At Nash Equilibrium, the defender provides safe responses to ANY adversarial input (Theorem 1). This motivates our game-theoretic approach to safety alignment beyond empirical defenses.

June 12, 2025 at 5:11 AM

Mickel Liu

@mickelliu.bsky.social

🤔Conventional LM safety alignment is reactive: find vulnerabilities→patch→repeat
🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵

June 12, 2025 at 5:11 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news