Mingqian Zheng
mingqian-zheng.bsky.social
Mingqian Zheng
@mingqian-zheng.bsky.social
🤖 [5/9] Paradoxically, partial compliance is rarely used by current LLMs and reward models don’t favor it either.
We reveal a major misalignment between:
1️⃣ What users prefer
2️⃣ What models actually do
3️⃣ What reward models reinforce
October 20, 2025 at 8:04 PM
💡[4/9] The best way to say “no” isn’t just saying no.
Partial compliance—giving general, non-actionable info instead of a flat “I can’t help.”—
→ Cuts negative perceptions by >50%
→ Keeps conversations safe yet engaging
October 20, 2025 at 8:04 PM
👥 [3/9] Across 480 participants and 3,840 query–response pairs, we find:
🚨 User intent matters far less than expected.
💬 It’s the refusal strategy that drives user experience.

Alignment with user expectations explains most perception variance.
October 20, 2025 at 8:04 PM
❓[2/9] LLMs refuse unsafe queries to protect users, but what if they refuse too bluntly?

We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.
October 20, 2025 at 8:04 PM
How and when should LLM guardrails be deployed to balance safety and user experience?

Our #EMNLP2025 paper reveals that crafting thoughtful refusals rather than detecting intent is the key to human-centered AI safety.

📄 arxiv.org/abs/2506.00195
🧵[1/9]
October 20, 2025 at 8:04 PM