Mingqian Zheng
mingqian-zheng.bsky.social
Mingqian Zheng
@mingqian-zheng.bsky.social
📰 [8/9]Our work was recently featured in Forbes, in a piece about models learning to end harmful conversations responsibly (www.forbes.com/sites/victor...). Conversation endings and refusal design are central to building safe yet engaging AI systems.
Anthropic’s Claude AI Can Now End Abusive Conversations For ‘Model Welfare’
Anthropic’s new feature for Claude Opus 4 and 4.1 flips the moral question: It’s no longer how AI should treat us, but how we should treat AI.
www.forbes.com
October 20, 2025 at 8:04 PM
📢 [7/9] Designing what to share vs. withhold remains a technical and ethical challenge. Partial compliance can blur what’s safe to share vs what must be withheld. We call for a better refusal design that safeguards users without legitimizing harm!
October 20, 2025 at 8:04 PM
💭 [6/9] Safety ≠ just saying “no.”
Models like GPT-5 are beginning to use safe-completions (i.e., partial compliance) that maximize helpfulness within safety limits (openai.com/index/gpt-5-...). It’s exciting to see this conversation expanding beyond research!
From hard refusals to safe-completions: toward output-centric safety training
Introduced in GPT-5, safe-completion is a new safety-training approach to maximize model helpfulness within safety constraints. Compared to refusal-based training, safe-completion improves both safety...
openai.com
October 20, 2025 at 8:04 PM
🤖 [5/9] Paradoxically, partial compliance is rarely used by current LLMs and reward models don’t favor it either.
We reveal a major misalignment between:
1️⃣ What users prefer
2️⃣ What models actually do
3️⃣ What reward models reinforce
October 20, 2025 at 8:04 PM
💡[4/9] The best way to say “no” isn’t just saying no.
Partial compliance—giving general, non-actionable info instead of a flat “I can’t help.”—
→ Cuts negative perceptions by >50%
→ Keeps conversations safe yet engaging
October 20, 2025 at 8:04 PM
👥 [3/9] Across 480 participants and 3,840 query–response pairs, we find:
🚨 User intent matters far less than expected.
💬 It’s the refusal strategy that drives user experience.

Alignment with user expectations explains most perception variance.
October 20, 2025 at 8:04 PM
❓[2/9] LLMs refuse unsafe queries to protect users, but what if they refuse too bluntly?

We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.
October 20, 2025 at 8:04 PM