Lightnews — Scholar-powered news

Polina Kirichenko

@polkirichenko.bsky.social

While we find that a carefully crafted system prompt can boost abstention performance, it doesn't fundamentally address the core problem: a lack of reasoning about uncertainty!
See our paper for many more other results!

7/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

We find that very often reasoning models hallucinate missing contexts in the reasoning chain and while sometimes they express uncertainty and the caveats within the reasoning chain, they still produce a confident final answer. We hypothesize this arises from biases in data & rewards in RLVR.

6/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Moreover, incorporating test-time scaling as in s1 @Muennighoff et al makes things even worse!
Allocating more reasoning budget generally improves accuracy and hurts abstention.

5/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Remarkably, we find that reasoning post-training hurts (!) abstention performance!
We evaluated the RLVR model from Tulu @natolambert et al, s1 and DeepSeek R1 Distill models and found consistent improvements in accuracy and drops in abstention compared to instruct models.

4/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

We curate 20 uncertainty datasets in different scenarios and evaluate 20 frontier LLMs, and find that most scenarios remain challenging even for the best models!
This allows us to conduct a systematic study of what helps and hurts abstention performance.

3/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

LLMs are great at solving concrete problems, but how well do they handle uncertainty? There are many questions with no direct answer!
We build a diverse benchmark spanning 6 abstention scenarios (underspecification, staleness, …) and various domains (medicine, social bias, …).

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Excited to release AbstentionBench -- our paper and benchmark on evaluating LLMs’ *abstention*: the skill of knowing when NOT to answer!

Key finding: reasoning LLMs struggle with unanswerable questions and hallucinate!

Paper: arxiv.org/abs/2506.09038
Code: github.com/facebookrese...
🧵1/9

June 16, 2025 at 10:03 PM

Polina Kirichenko

@polkirichenko.bsky.social

Join us at #CVPR2025 Demographic Diversity in Computer Vision workshop tomorrow!
📅 Wednesday, June 11, 9am-6pm
📍 room 213 (main session) + Hall D (poster sessions), the Music City Center
We have an amazing lineup of speakers and panelists! Can't wait to meet you all there :)

June 10, 2025 at 1:07 PM

Polina Kirichenko

@polkirichenko.bsky.social

We are excited to announce a workshop on Demographic Diversity in Computer Vision (DemoDiv) at #CVPR 2025!

Submit your work studying various axes of demographic diversity and fairness in models and datasets and join us in Nashville in June!
Deadline: March 31st
sites.google.com/view/cvpr-20...

February 21, 2025 at 5:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news