Lightnews — Scholar-powered news

John (Yueh-Han) Chen

@johnchen6.bsky.social

17 followers 97 following 18 posts

Graduate Student Researcher @nyu
prev @ucberkeley
https://john-chen.cc

Posts Replies Media Videos

John (Yueh-Han) Chen

@johnchen6.bsky.social

In practice, deployed LLMs will face a vastly richer and more varied set of user prompts than any finite benchmark can cover. We show that model developers can forecast SAGE-Eval safety scores with at least one order of magnitude more prompts per fact with a power law fit. 9/🧵

May 29, 2025 at 4:56 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

Finding 4: Model capability and training compute only weakly correlate with performance on SAGE-Eval, demonstrating that our benchmark effectively avoids “safetywashing”—a scenario where capability improvements are incorrectly portrayed as advancements in safety. 8/🧵

May 29, 2025 at 4:56 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

Finding 3: certain tones degrade safety performance. 7/🧵 In real life, users might prompt LMs in different tones. The depressed tone reduces the safety score to 0.865, noticeably below the no-augmentation baseline of 0.907.

May 29, 2025 at 4:56 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

Finding 2: Long context undermines risk awareness. Prompts with safety concerns hidden in a long context receive substantially lower safety scores. 6/🧵

May 29, 2025 at 4:55 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

Finding 1: All frontier LLMs we tested score <58% safety scores.
Our model-level safety score is defined as % of safety facts 100% passed all test scenario prompts (~100 scenarios per safety fact).
5/🧵

May 29, 2025 at 4:55 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

>Do LLMs robustly generalize critical safety facts to novel scenarios?
Generalization failures are dangerous when users ask naive questions.

1/🧵

May 29, 2025 at 4:53 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

Arxiv: arxiv.org/pdf/2505.21828
Joint work with @guydav.bsky.social @brendenlake.bsky.social
🧵 starts below!

May 29, 2025 at 4:52 PM

John (Yueh-Han) Chen

@johnchen6.bsky.social

Do LLMs show systematic generalization of safety facts to novel scenarios?

Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!

- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵

May 29, 2025 at 4:51 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news