Archiki Prasad
@archiki.bsky.social
Ph.D. Student at UNC NLP | Apple Scholar in AI/ML Ph.D. Fellowship | Prev: FAIR at Meta, AI2, Adobe (Intern) | Interests: #NLP, #ML | https://archiki.github.io/
Thanks to my coauthors: @hwang98.bsky.social @esteng.bsky.social @mohitbansal.bsky.social
for the fun collaboration!
@unccs.bsky.social
Paper: arxiv.org/abs/2504.13079
Data and Code: github.com/HanNight/RAM...
for the fun collaboration!
@unccs.bsky.social
Paper: arxiv.org/abs/2504.13079
Data and Code: github.com/HanNight/RAM...
Retrieval-Augmented Generation with Conflicting Evidence
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle...
arxiv.org
April 18, 2025 at 5:10 PM
Thanks to my coauthors: @hwang98.bsky.social @esteng.bsky.social @mohitbansal.bsky.social
for the fun collaboration!
@unccs.bsky.social
Paper: arxiv.org/abs/2504.13079
Data and Code: github.com/HanNight/RAM...
for the fun collaboration!
@unccs.bsky.social
Paper: arxiv.org/abs/2504.13079
Data and Code: github.com/HanNight/RAM...
Can RAG systems handle imbalanced evidence or increasing misinformation?
➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance
➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance
➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
April 18, 2025 at 5:06 PM
Can RAG systems handle imbalanced evidence or increasing misinformation?
➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance
➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance
➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
How important are multi-round debate and aggregation in MADAM-RAG?
Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.
Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.
Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
April 18, 2025 at 5:06 PM
How important are multi-round debate and aggregation in MADAM-RAG?
Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.
Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.
Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
We evaluate on 3 datasets: FaithEval (suppression of misinformation), AmbigDocs (disambiguation across sources), RAMDocs (our dataset w/ different types of conflict).
MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
April 18, 2025 at 5:06 PM
We evaluate on 3 datasets: FaithEval (suppression of misinformation), AmbigDocs (disambiguation across sources), RAMDocs (our dataset w/ different types of conflict).
MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
We propose MADAM-RAG, a structured, multi-agent framework designed to handle inter-doc conflicts, misinformation, & noise in retrieved content, comprising:
1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
April 18, 2025 at 5:06 PM
We propose MADAM-RAG, a structured, multi-agent framework designed to handle inter-doc conflicts, misinformation, & noise in retrieved content, comprising:
1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
📂RAMDocs is designed to reflect the complexities of real-world retrieval. It includes:
➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
April 18, 2025 at 5:06 PM
📂RAMDocs is designed to reflect the complexities of real-world retrieval. It includes:
➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
Thanks Jaemin, learned so much from you as well!
March 27, 2025 at 8:24 PM
Thanks Jaemin, learned so much from you as well!
Thanks to my amazing co-authors @esteng.bsky.social (co-lead), @cyjustinchen.bsky.social, @codezakh.bsky.social, @mohitbansal.bsky.social
@unccs.bsky.social
Paper: arxiv.org/abs/2502.01619
Code+Datasets: github.com/archiki/UTGe...
@unccs.bsky.social
Paper: arxiv.org/abs/2502.01619
Code+Datasets: github.com/archiki/UTGe...
Learning to Generate Unit Tests for Automated Debugging
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test g...
arxiv.org
February 4, 2025 at 7:10 PM
Thanks to my amazing co-authors @esteng.bsky.social (co-lead), @cyjustinchen.bsky.social, @codezakh.bsky.social, @mohitbansal.bsky.social
@unccs.bsky.social
Paper: arxiv.org/abs/2502.01619
Code+Datasets: github.com/archiki/UTGe...
@unccs.bsky.social
Paper: arxiv.org/abs/2502.01619
Code+Datasets: github.com/archiki/UTGe...
Lastly, we show that both test-time scaling and backtracking are crucial for UTDebug, and scaling the number of generated UTs also consistently improves code accuracy.
February 4, 2025 at 7:10 PM
Lastly, we show that both test-time scaling and backtracking are crucial for UTDebug, and scaling the number of generated UTs also consistently improves code accuracy.
Combining UTGen with UTDebug 🤝 we consistently outperform no UT feedback, randomly sampling UTs, and prompting targeted UTs across 3 models & datasets.
For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
February 4, 2025 at 7:10 PM
Combining UTGen with UTDebug 🤝 we consistently outperform no UT feedback, randomly sampling UTs, and prompting targeted UTs across 3 models & datasets.
For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
RQ3: We also propose ✨UTDebug ✨ with two key modifications:
1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.
2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise
1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.
2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise
February 4, 2025 at 7:10 PM
RQ3: We also propose ✨UTDebug ✨ with two key modifications:
1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.
2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise
1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.
2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise