Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.
Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?
We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?
We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵