Lightnews — Scholar-powered news

Mingqian Zheng

@mingqian-zheng.bsky.social

[9/9] Big THANKS to my amazing collaborators @jiajiah.bsky.social @pigeonzow.bsky.social Motahhare Eslami, Jena Hwang @faebrahman.bsky.social, @carolynrose.bsky.social @maartensap.bsky.social from @ltiatcmu.bsky.social
Pareto.ai @sfu.ca @ai2.bsky.social ♥️
📂 github.com/EEElisa/LLM-Guardrails

GitHub - EEElisa/LLM-Guardrails

Contribute to EEElisa/LLM-Guardrails development by creating an account on GitHub.

github.com

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

📰 [8/9]Our work was recently featured in Forbes, in a piece about models learning to end harmful conversations responsibly (www.forbes.com/sites/victor...). Conversation endings and refusal design are central to building safe yet engaging AI systems.

Anthropic’s Claude AI Can Now End Abusive Conversations For ‘Model Welfare’

Anthropic’s new feature for Claude Opus 4 and 4.1 flips the moral question: It’s no longer how AI should treat us, but how we should treat AI.

www.forbes.com

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

📢 [7/9] Designing what to share vs. withhold remains a technical and ethical challenge. Partial compliance can blur what’s safe to share vs what must be withheld. We call for a better refusal design that safeguards users without legitimizing harm!

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

💭 [6/9] Safety ≠ just saying “no.”
Models like GPT-5 are beginning to use safe-completions (i.e., partial compliance) that maximize helpfulness within safety limits (openai.com/index/gpt-5-...). It’s exciting to see this conversation expanding beyond research!

From hard refusals to safe-completions: toward output-centric safety training

Introduced in GPT-5, safe-completion is a new safety-training approach to maximize model helpfulness within safety constraints. Compared to refusal-based training, safe-completion improves both safety...

openai.com

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

🤖 [5/9] Paradoxically, partial compliance is rarely used by current LLMs and reward models don’t favor it either.
We reveal a major misalignment between:
1️⃣ What users prefer
2️⃣ What models actually do
3️⃣ What reward models reinforce

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

💡[4/9] The best way to say “no” isn’t just saying no.
Partial compliance—giving general, non-actionable info instead of a flat “I can’t help.”—
→ Cuts negative perceptions by >50%
→ Keeps conversations safe yet engaging

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

👥 [3/9] Across 480 participants and 3,840 query–response pairs, we find:
🚨 User intent matters far less than expected.
💬 It’s the refusal strategy that drives user experience.

Alignment with user expectations explains most perception variance.

October 20, 2025 at 8:04 PM

Mingqian Zheng

@mingqian-zheng.bsky.social

❓[2/9] LLMs refuse unsafe queries to protect users, but what if they refuse too bluntly?

We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.

October 20, 2025 at 8:04 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news