Neel Bhandari
@neelbhandari.bsky.social
Masters Student @LTIatCMU | ML Scientist @PayPal | Open Research @CohereForAI Community | Previously External Research Student @MITIBMLab. Views my own.
11/ This paper has been an incredible effort across institutions @ltiatcmu.bsky.social @uwcse.bsky.social. Huge thanks to my co-first author @tianyucao.bsky.social and co-authors @akhilayerukola.bsky.social @akariasai.bsky.social @maartensap.bsky.social ✨🚀
April 17, 2025 at 7:55 PM
11/ This paper has been an incredible effort across institutions @ltiatcmu.bsky.social @uwcse.bsky.social. Huge thanks to my co-first author @tianyucao.bsky.social and co-authors @akhilayerukola.bsky.social @akariasai.bsky.social @maartensap.bsky.social ✨🚀
10/ 📜 Paper: "Out of Style: RAG’s Fragility to Linguistic Variation": arxiv.org/abs/2504.08231
🔬 Code: github.com/Springcty/RA...
Read our paper for more details on impact of scaling retrieved documents, specific effects of each linguistic variation on RAG pipelines and much more!
🔬 Code: github.com/Springcty/RA...
Read our paper for more details on impact of scaling retrieved documents, specific effects of each linguistic variation on RAG pipelines and much more!
Out of Style: RAG's Fragility to Linguistic Variation
Despite the impressive performance of Retrieval-augmented Generation (RAG) systems across various NLP benchmarks, their robustness in handling real-world user-LLM interaction queries remains largely u...
arxiv.org
April 17, 2025 at 7:55 PM
10/ 📜 Paper: "Out of Style: RAG’s Fragility to Linguistic Variation": arxiv.org/abs/2504.08231
🔬 Code: github.com/Springcty/RA...
Read our paper for more details on impact of scaling retrieved documents, specific effects of each linguistic variation on RAG pipelines and much more!
🔬 Code: github.com/Springcty/RA...
Read our paper for more details on impact of scaling retrieved documents, specific effects of each linguistic variation on RAG pipelines and much more!
9/ 🚨 Takeaway
RAG systems suffer major performance drops from simple linguistic variations.
Advanced techniques offer temporary relief, but real robustness demands fundamental changes - more resilient components and fewer cascading error in order to serve all users effectively.
RAG systems suffer major performance drops from simple linguistic variations.
Advanced techniques offer temporary relief, but real robustness demands fundamental changes - more resilient components and fewer cascading error in order to serve all users effectively.
April 17, 2025 at 7:55 PM
9/ 🚨 Takeaway
RAG systems suffer major performance drops from simple linguistic variations.
Advanced techniques offer temporary relief, but real robustness demands fundamental changes - more resilient components and fewer cascading error in order to serve all users effectively.
RAG systems suffer major performance drops from simple linguistic variations.
Advanced techniques offer temporary relief, but real robustness demands fundamental changes - more resilient components and fewer cascading error in order to serve all users effectively.
8/🛠️ Adding advanced techniques to vanilla RAG improve robustness... sometimes🫠
✅ Reranking improves performance on linguistic rewrites, but gaps in performance with original queries remain.
⚠️ HyDE helps rewritten queries but hurts original queries-creating a false sense of robustness
✅ Reranking improves performance on linguistic rewrites, but gaps in performance with original queries remain.
⚠️ HyDE helps rewritten queries but hurts original queries-creating a false sense of robustness
April 17, 2025 at 7:55 PM
8/🛠️ Adding advanced techniques to vanilla RAG improve robustness... sometimes🫠
✅ Reranking improves performance on linguistic rewrites, but gaps in performance with original queries remain.
⚠️ HyDE helps rewritten queries but hurts original queries-creating a false sense of robustness
✅ Reranking improves performance on linguistic rewrites, but gaps in performance with original queries remain.
⚠️ HyDE helps rewritten queries but hurts original queries-creating a false sense of robustness
7/🤔Well, maybe scaling generation model size helps?
Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
April 17, 2025 at 7:55 PM
7/🤔Well, maybe scaling generation model size helps?
Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
6/⚖️ RAG is more fragile than LLM-only setups
RAG’s retrieval-generation pipeline amplifies linguistic errors, leading to greater performance drops. On PopQA, RAG degrades by 23% vs. just 11% for the LLM-only setup.
⚠️The main culprit? Retrieval emerges as the weakest link
RAG’s retrieval-generation pipeline amplifies linguistic errors, leading to greater performance drops. On PopQA, RAG degrades by 23% vs. just 11% for the LLM-only setup.
⚠️The main culprit? Retrieval emerges as the weakest link
April 17, 2025 at 7:55 PM
6/⚖️ RAG is more fragile than LLM-only setups
RAG’s retrieval-generation pipeline amplifies linguistic errors, leading to greater performance drops. On PopQA, RAG degrades by 23% vs. just 11% for the LLM-only setup.
⚠️The main culprit? Retrieval emerges as the weakest link
RAG’s retrieval-generation pipeline amplifies linguistic errors, leading to greater performance drops. On PopQA, RAG degrades by 23% vs. just 11% for the LLM-only setup.
⚠️The main culprit? Retrieval emerges as the weakest link
5/🧩 Generation Fragility
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
April 17, 2025 at 7:55 PM
5/🧩 Generation Fragility
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
4/📌Retrieval Robustness
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
April 17, 2025 at 7:55 PM
4/📌Retrieval Robustness
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
3/ We evaluate across an extensive experimental setup:
🧲 2 Retrievers (Contriever, ModernBERT)
🤖 9 open LLMs (3B–72B)
📚 4 QA datasets (MS MARCO, PopQA, Natural Questions, EntityQuestions)
🔁 Over 50K+ linguistically varied queries per dataset
🧲 2 Retrievers (Contriever, ModernBERT)
🤖 9 open LLMs (3B–72B)
📚 4 QA datasets (MS MARCO, PopQA, Natural Questions, EntityQuestions)
🔁 Over 50K+ linguistically varied queries per dataset
April 17, 2025 at 7:55 PM
3/ We evaluate across an extensive experimental setup:
🧲 2 Retrievers (Contriever, ModernBERT)
🤖 9 open LLMs (3B–72B)
📚 4 QA datasets (MS MARCO, PopQA, Natural Questions, EntityQuestions)
🔁 Over 50K+ linguistically varied queries per dataset
🧲 2 Retrievers (Contriever, ModernBERT)
🤖 9 open LLMs (3B–72B)
📚 4 QA datasets (MS MARCO, PopQA, Natural Questions, EntityQuestions)
🔁 Over 50K+ linguistically varied queries per dataset
2/🔍 We evaluated RAG robustness against four common linguistic variations:
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
April 17, 2025 at 7:55 PM
2/🔍 We evaluated RAG robustness against four common linguistic variations:
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
Not at all. I just hope the wake up call happens by the end of the month, with the help of a stern winter wind.
January 6, 2025 at 6:52 PM
Not at all. I just hope the wake up call happens by the end of the month, with the help of a stern winter wind.