Archiki Prasad
banner
archiki.bsky.social
Archiki Prasad
@archiki.bsky.social
Ph.D. Student at UNC NLP | Apple Scholar in AI/ML Ph.D. Fellowship | Prev: FAIR at Meta, AI2, Adobe (Intern) | Interests: #NLP, #ML | https://archiki.github.io/
Can RAG systems handle imbalanced evidence or increasing misinformation?

➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance

➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
April 18, 2025 at 5:06 PM
How important are multi-round debate and aggregation in MADAM-RAG?

Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.

Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
April 18, 2025 at 5:06 PM
We evaluate on 3 datasets: FaithEval (suppression of misinformation), AmbigDocs (disambiguation across sources), RAMDocs (our dataset w/ different types of conflict).

MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
April 18, 2025 at 5:06 PM
We propose MADAM-RAG, a structured, multi-agent framework designed to handle inter-doc conflicts, misinformation, & noise in retrieved content, comprising:

1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
April 18, 2025 at 5:06 PM
📂RAMDocs is designed to reflect the complexities of real-world retrieval. It includes:

➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
April 18, 2025 at 5:06 PM
Thanks Elias!!
March 27, 2025 at 8:24 PM
Thanks Jaemin, learned so much from you as well!
March 27, 2025 at 8:24 PM
Lastly, we show that both test-time scaling and backtracking are crucial for UTDebug, and scaling the number of generated UTs also consistently improves code accuracy.
February 4, 2025 at 7:10 PM
Combining UTGen with UTDebug 🤝 we consistently outperform no UT feedback, randomly sampling UTs, and prompting targeted UTs across 3 models & datasets.

For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
February 4, 2025 at 7:10 PM
RQ3: We also propose ✨UTDebug ✨ with two key modifications:

1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.

2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise
February 4, 2025 at 7:10 PM