And thanks to my collaborators!
Sachin Goyal
Kaiyue Wen
Tanishq Kumar
@xiangyue96.bsky.social
@sadhika.bsky.social
@gneubig.bsky.social
@adtraghunathan.bsky.social
10/10
And thanks to my collaborators!
Sachin Goyal
Kaiyue Wen
Tanishq Kumar
@xiangyue96.bsky.social
@sadhika.bsky.social
@gneubig.bsky.social
@adtraghunathan.bsky.social
10/10
9/10
9/10
8/10
8/10
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉
7/10
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉
7/10
6/10
6/10
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)
5/10
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)
5/10
4/10
4/10
Why does extended pre-training hurt fine-tuning performance? 🤔
3/10
Why does extended pre-training hurt fine-tuning performance? 🤔
3/10
➡️ Better base models? Yes.
➡️ Better starting point for post-training? Let’s check!
2/10
➡️ Better base models? Yes.
➡️ Better starting point for post-training? Let’s check!
2/10