Joe Stacey
joestacey.bsky.social
Joe Stacey
@joestacey.bsky.social
NLP PhD student at Imperial College London and Apple AI/ML Scholar.
This was just embarrassing. Shame on everyone who works on Grok…
November 15, 2025 at 11:10 AM
Congratulations!! Awesome you will be in Europe!
July 22, 2025 at 7:49 PM
The bad:

- the chocolate here is terrible for no good reason
- hotel breakfasts never have any baked beans, which are way under appreciated here (they are delicious and add much needed moisture to a cooked breakfast)
- the temperature in summer is inhumane

Think that covers the main stuff 😍
July 17, 2025 at 11:24 AM
This work was really fun and a great last paper for my PhD. Check it out 🙂 Massive thanks to all my amazing collaborators!

arxiv.org/abs/2505.20209

P.S. if you know about a paper improving NLI model robustness not already in our related work appendix, I would love to hear about it 🥰
How to Improve the Robustness of Closed-Source Models on NLI
Closed-source Large Language Models (LLMs) have become increasingly popular, with impressive performance across a wide range of natural language tasks. These models can be fine-tuned to further improv...
arxiv.org
May 27, 2025 at 3:50 PM
5) The best way to improve performance on the hardest OOD data was to choose more challenging training examples

Our best method (Uncertainty Sampling) picked examples with the most uncertain predictions. This identified challenging examples, but without too much label noise
May 27, 2025 at 3:50 PM
4) Creating more complex synthetic data avoids a loss in performance on harder OOD datasets

We find that generating more challenging synthetic data (Long & Complex Generation) helps retain performance on harder OOD datasets, while still achieving gains on easier OOD data
May 27, 2025 at 3:50 PM
3) Replacing some training examples with LLM-generated data proved very effective on less challenging OOD data

See Standard-OOD scores below (avg), where the simplest LLM-generated data (Short & Simple Generation) performed best, with substantial improvements
May 27, 2025 at 3:50 PM
2) We experiment with 6+ ways for improving robustness:

This involved sampling methods to choose more complex examples in our training data, and generating new synthetic examples

Some methods were pretty fun, e.g. asking an LLM to assess the difficulty of training examples
May 27, 2025 at 3:50 PM
1) It's time to stop using fine-tuned encoder models:

We find that fine-tuned LLMs are substantially more robust than commonly used encoder models, despite being fine-tuned on x50 less data.

This is especially the case on challenging OOD datasets (see Challenge-OOD avg below)
May 27, 2025 at 3:50 PM
The paper tries to improve the robustness of closed-source LLMs fine-tuned on NLI, assuming a realistic training budget of 10k training examples.

Here's a 45 second rundown of what we found!
May 27, 2025 at 3:50 PM
I’d personally just love to see more negative results from nice ideas that didn’t quite work out. I feel like there’s probably a bunch of cool stuff people have tried out and discarded that could be made to work across multiple papers. Would be fun and interesting too
May 18, 2025 at 3:48 PM
Was worried it was just me hating on it so much 🤣
May 18, 2025 at 11:01 AM
I’d love to see more diversity in the field, what kind of things were you thinking?
May 18, 2025 at 9:06 AM
Looks so cool! I’m insanely jealous
April 28, 2025 at 5:14 PM
I’m not a fan of musk, but imo there’s some really nice work here 🙂

Interested in the Washington post article, would you mind sharing a link?
April 23, 2025 at 6:01 AM
That’s an awesome paper 👍👍
April 14, 2025 at 5:29 PM