8/10
8/10
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉
7/10
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉
7/10
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)
5/10
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)
5/10
Why does extended pre-training hurt fine-tuning performance? 🤔
3/10
Why does extended pre-training hurt fine-tuning performance? 🤔
3/10
False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." 🥁🧵👇
arxiv.org/abs/2503.19206
1/10
False! Scaling language models by adding more pre-training data can decrease your performance after post-training!
Introducing "catastrophic overtraining." 🥁🧵👇
arxiv.org/abs/2503.19206
1/10