Hailey Joren
haileyjoren.bsky.social
Hailey Joren
@haileyjoren.bsky.social
PhD Student @ UC San Diego

Researching reliable, interpretable, and human-aligned ML/AI
I couldn't make it to ICLR this year but co-author @cyroid.bsky.social will be around to chat!
📄 Paper (ICLR ’25): arxiv.org/abs/2411.06037
💻 Key Findings & Prompts: github.com/hljoren/suff...
#RAG #ICLR2025
Sufficient Context: A New Lens on Retrieval Augmented Generation Systems
Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise bec...
arxiv.org
April 24, 2025 at 6:18 PM
Our work suggests that solving RAG hallucination problems requires moving beyond just improving retrieval—we need models that can accurately determine when retrieved information suffices for answering and abstain when appropriate confidence thresholds aren't met.
April 24, 2025 at 6:18 PM
Building on these insights, we developed a selective generation framework using both sufficient context signals and model confidence to decide when to respond vs. abstain—improving accuracy of responses by 2-10% for Gemini, GPT, and Gemma.
April 24, 2025 at 6:18 PM
Intriguingly, models sometimes generate correct answers despite insufficient context. We taxonomize these cases: parametric knowledge bridging information gaps, yes/no questions with 50% chance of correctness, and instances where the context provides partial reasoning paths.
April 24, 2025 at 6:18 PM
We analyzed standard QA datasets through our sufficient context lens and found a surprising percentage lack sufficient information: ~56% for Musique, ~56% for HotpotQA, and ~23% for FreshQA. This highlights the magnitude of the information retrieval challenge.
April 24, 2025 at 6:18 PM
Conversely, smaller models (Mistral 3, Gemma 2) struggle even with sufficient context—either hallucinating or failing to extract answers from the provided information. Neither approach solves the fundamental RAG reliability challenge.
April 24, 2025 at 6:18 PM
A major finding: When context is sufficient, larger models (Gemini 1.5 Pro, GPT-4o, Claude 3.5) excel. But when it's insufficient, they're more likely to hallucinate than abstain—presenting incorrect answers with high confidence.
April 24, 2025 at 6:18 PM