https://eeelisa.github.io/
Pareto.ai @sfu.ca @ai2.bsky.social ♥️
📂 github.com/EEElisa/LLM-Guardrails
Pareto.ai @sfu.ca @ai2.bsky.social ♥️
📂 github.com/EEElisa/LLM-Guardrails
Models like GPT-5 are beginning to use safe-completions (i.e., partial compliance) that maximize helpfulness within safety limits (openai.com/index/gpt-5-...). It’s exciting to see this conversation expanding beyond research!
Models like GPT-5 are beginning to use safe-completions (i.e., partial compliance) that maximize helpfulness within safety limits (openai.com/index/gpt-5-...). It’s exciting to see this conversation expanding beyond research!
We reveal a major misalignment between:
1️⃣ What users prefer
2️⃣ What models actually do
3️⃣ What reward models reinforce
We reveal a major misalignment between:
1️⃣ What users prefer
2️⃣ What models actually do
3️⃣ What reward models reinforce
Partial compliance—giving general, non-actionable info instead of a flat “I can’t help.”—
→ Cuts negative perceptions by >50%
→ Keeps conversations safe yet engaging
Partial compliance—giving general, non-actionable info instead of a flat “I can’t help.”—
→ Cuts negative perceptions by >50%
→ Keeps conversations safe yet engaging
🚨 User intent matters far less than expected.
💬 It’s the refusal strategy that drives user experience.
Alignment with user expectations explains most perception variance.
🚨 User intent matters far less than expected.
💬 It’s the refusal strategy that drives user experience.
Alignment with user expectations explains most perception variance.
We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.
We investigate the contextual effects of user motivation and refusal strategies on user perceptions of LLM guardrails and model usage of refusals across safety categories.