Lightnews — Scholar-powered news

Hailey Joren

@haileyjoren.bsky.social

460 followers 77 following 8 posts

PhD Student @ UC San Diego

Researching reliable, interpretable, and human-aligned ML/AI

Posts Replies Media Videos

Hailey Joren

@haileyjoren.bsky.social

I couldn't make it to ICLR this year but co-author @cyroid.bsky.social will be around to chat!
📄 Paper (ICLR ’25): arxiv.org/abs/2411.06037
💻 Key Findings & Prompts: github.com/hljoren/suff...
#RAG #ICLR2025

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise bec...

arxiv.org

April 24, 2025 at 6:18 PM

Hailey Joren

@haileyjoren.bsky.social

Our work suggests that solving RAG hallucination problems requires moving beyond just improving retrieval—we need models that can accurately determine when retrieved information suffices for answering and abstain when appropriate confidence thresholds aren't met.

April 24, 2025 at 6:18 PM

Hailey Joren

@haileyjoren.bsky.social

Building on these insights, we developed a selective generation framework using both sufficient context signals and model confidence to decide when to respond vs. abstain—improving accuracy of responses by 2-10% for Gemini, GPT, and Gemma.

Line graph comparing selective generation methods showing coverage vs. accuracy trade-offs. Purple lines (sufficient context + confidence) outperform gray lines (confidence only), especially for HotpotQA dataset and Gemini model.

Diagram of the Selective Generation Pipeline. The workflow shows how Input Query and Input Context feed into both Self-reported model confidence (gray box) and Sufficient Context AutoRater label (purple box). These signals combine in a Logistic regression model, which produces a score. This score is compared against a Threshold determined by Desired coverage. Depending on the comparison, the system either proceeds with the Model Response (green box) or chooses to Abstain (blue box).

April 24, 2025 at 6:18 PM

Hailey Joren

@haileyjoren.bsky.social

Intriguingly, models sometimes generate correct answers despite insufficient context. We taxonomize these cases: parametric knowledge bridging information gaps, yes/no questions with 50% chance of correctness, and instances where the context provides partial reasoning paths.

Table categorizing cases where models correctly answer questions despite insufficient context, including yes/no questions, limited choice questions, multi-hop fragments, partial information, and cases where parametric knowledge bridges gaps.

April 24, 2025 at 6:18 PM

Hailey Joren

@haileyjoren.bsky.social

We analyzed standard QA datasets through our sufficient context lens and found a surprising percentage lack sufficient information: ~56% for Musique, ~56% for HotpotQA, and ~23% for FreshQA. This highlights the magnitude of the information retrieval challenge.

Bar graph showing percentage of instances with sufficient context across datasets. FreshQA has highest sufficient context (77%), while HotpotQA and Musique have around 44-45% sufficient context.

April 24, 2025 at 6:18 PM

Hailey Joren

@haileyjoren.bsky.social

Conversely, smaller models (Mistral 3, Gemma 2) struggle even with sufficient context—either hallucinating or failing to extract answers from the provided information. Neither approach solves the fundamental RAG reliability challenge.

April 24, 2025 at 6:18 PM

Hailey Joren

@haileyjoren.bsky.social

A major finding: When context is sufficient, larger models (Gemini 1.5 Pro, GPT-4o, Claude 3.5) excel. But when it's insufficient, they're more likely to hallucinate than abstain—presenting incorrect answers with high confidence.

Bar chart comparing model performance on datasets stratified by sufficient context. Graph shows that larger models (Gemini, GPT, Claude) perform better with sufficient context but still hallucinate with insufficient context, while smaller models (Gemma) struggle across conditions.

April 24, 2025 at 6:18 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news