Neel Bhandari
neelbhandari.bsky.social
Neel Bhandari
@neelbhandari.bsky.social
Masters Student @LTIatCMU | ML Scientist @PayPal | Open Research @CohereForAI Community | Previously External Research Student @MITIBMLab. Views my own.
7/🤔Well, maybe scaling generation model size helps?

Scaling up LLM size helps narrow the performance gap between original and rewritten queries. However, this is not consistent across variations. Larger models occasionally worsen the impact, particularly with RTT variations.
April 17, 2025 at 7:55 PM
5/🧩 Generation Fragility

Linguistic variations lead to generation accuracy drops-Exact Match score down by up to ~41%, Answer Match score by up to ~17%.

Structural changes from RTT are particularly damaging, significantly reducing response accuracy.
April 17, 2025 at 7:55 PM
4/📌Retrieval Robustness

Retrieval recall plummets up to 40.41% due to linguistic variations, especially when exposed to informal queries. Grammatical errors like RTT and typos notably degrade performance, highlighting retrievers’ sensitivity to a number of linguistic variations
April 17, 2025 at 7:55 PM
2/🔍 We evaluated RAG robustness against four common linguistic variations:
✍️ Lower formality
📉 Lower readability
🙂 Increased politeness
🔤 Grammatical errors (from typos & from round-trip translations (RTT))
April 17, 2025 at 7:55 PM
1/🚨 𝗡𝗲𝘄 𝗽𝗮𝗽𝗲𝗿 𝗮𝗹𝗲𝗿𝘁 🚨
RAG systems excel on academic benchmarks - but are they robust to variations in linguistic style?

We find RAG systems are brittle. Small shifts in phrasing trigger cascading errors, driven by the complexity of the RAG pipeline 🧵
April 17, 2025 at 7:55 PM