- the chocolate here is terrible for no good reason
- hotel breakfasts never have any baked beans, which are way under appreciated here (they are delicious and add much needed moisture to a cooked breakfast)
- the temperature in summer is inhumane
Think that covers the main stuff 😍
- the chocolate here is terrible for no good reason
- hotel breakfasts never have any baked beans, which are way under appreciated here (they are delicious and add much needed moisture to a cooked breakfast)
- the temperature in summer is inhumane
Think that covers the main stuff 😍
arxiv.org/abs/2505.20209
P.S. if you know about a paper improving NLI model robustness not already in our related work appendix, I would love to hear about it 🥰
arxiv.org/abs/2505.20209
P.S. if you know about a paper improving NLI model robustness not already in our related work appendix, I would love to hear about it 🥰
Our best method (Uncertainty Sampling) picked examples with the most uncertain predictions. This identified challenging examples, but without too much label noise
Our best method (Uncertainty Sampling) picked examples with the most uncertain predictions. This identified challenging examples, but without too much label noise
We find that generating more challenging synthetic data (Long & Complex Generation) helps retain performance on harder OOD datasets, while still achieving gains on easier OOD data
We find that generating more challenging synthetic data (Long & Complex Generation) helps retain performance on harder OOD datasets, while still achieving gains on easier OOD data
See Standard-OOD scores below (avg), where the simplest LLM-generated data (Short & Simple Generation) performed best, with substantial improvements
See Standard-OOD scores below (avg), where the simplest LLM-generated data (Short & Simple Generation) performed best, with substantial improvements
This involved sampling methods to choose more complex examples in our training data, and generating new synthetic examples
Some methods were pretty fun, e.g. asking an LLM to assess the difficulty of training examples
This involved sampling methods to choose more complex examples in our training data, and generating new synthetic examples
Some methods were pretty fun, e.g. asking an LLM to assess the difficulty of training examples
We find that fine-tuned LLMs are substantially more robust than commonly used encoder models, despite being fine-tuned on x50 less data.
This is especially the case on challenging OOD datasets (see Challenge-OOD avg below)
We find that fine-tuned LLMs are substantially more robust than commonly used encoder models, despite being fine-tuned on x50 less data.
This is especially the case on challenging OOD datasets (see Challenge-OOD avg below)
Here's a 45 second rundown of what we found!
Here's a 45 second rundown of what we found!
Interested in the Washington post article, would you mind sharing a link?
Interested in the Washington post article, would you mind sharing a link?