Check out more details at:
📍𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.07468
📍𝐂𝐨𝐝𝐞: github.com/mickelliu/s...
Check out more details at:
📍𝐏𝐚𝐩𝐞𝐫: arxiv.org/abs/2506.07468
📍𝐂𝐨𝐝𝐞: github.com/mickelliu/s...
We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together.
Our code is also open-sourced (see next)!
We built on OpenRLHF and their Re++ algorithm, a critic-free method like GPRO. Both roles share the same LLM parameters and mix the training experiences for gradient descent together.
Our code is also open-sourced (see next)!
Defender performance improves gradually as the defender wins more, while the attacker must continuously adapt. This contrasts with static training, where the trainable part converges easily and stops improving.
Defender performance improves gradually as the defender wins more, while the attacker must continuously adapt. This contrasts with static training, where the trainable part converges easily and stops improving.
✅ Significant improvement on harmful refusal accuracy compared to the abliterated and instruct (IT) models (Table 1)
✅ Minimal compromise on benign compliance & general abilities (see Table 2 in the text).
✅ Significant improvement on harmful refusal accuracy compared to the abliterated and instruct (IT) models (Table 1)
✅ Minimal compromise on benign compliance & general abilities (see Table 2 in the text).
Attacker-only training collapses into repetitive patterns (see red clusters), whereas self-play / co-evolution maintains semantic diversity throughout training (see blue spread). Self-play can ensure coverage over a wider attack surface.
Attacker-only training collapses into repetitive patterns (see red clusters), whereas self-play / co-evolution maintains semantic diversity throughout training (see blue spread). Self-play can ensure coverage over a wider attack surface.
1) We train ONE model to play BOTH roles in a 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐳𝐞𝐫𝐨-𝐬𝐮𝐦 game fully online! This enables continuous co-evolution.
2) 𝐇𝐢𝐝𝐝𝐞𝐧 Chain-of-Thought enables strategic reasoning invisible to opponents.
1) We train ONE model to play BOTH roles in a 𝐬𝐞𝐥𝐟-𝐩𝐥𝐚𝐲 𝐳𝐞𝐫𝐨-𝐬𝐮𝐦 game fully online! This enables continuous co-evolution.
2) 𝐇𝐢𝐝𝐝𝐞𝐧 Chain-of-Thought enables strategic reasoning invisible to opponents.
At Nash Equilibrium, the defender provides safe responses to ANY adversarial input (Theorem 1). This motivates our game-theoretic approach to safety alignment beyond empirical defenses.
At Nash Equilibrium, the defender provides safe responses to ANY adversarial input (Theorem 1). This motivates our game-theoretic approach to safety alignment beyond empirical defenses.
🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵
🌟We propose 𝗼𝗻𝗹𝗶𝗻𝗲 𝐦𝐮𝐥𝐭𝐢-𝐚𝐠𝐞𝐧𝐭 𝗥𝗟 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 where Attacker & Defender self-play to co-evolve, finding diverse attacks and improving safety by up to 72% vs. RLHF 🧵