Jacob Springer
jacobspringer.bsky.social
Jacob Springer
@jacobspringer.bsky.social
Machine Learning (the science part) | PhD student @ CMU
We also have so many other interesting details in the paper that have entirely changed the way I think about pre-training!

And thanks to my collaborators!
Sachin Goyal
Kaiyue Wen
Tanishq Kumar
@xiangyue96.bsky.social
@sadhika.bsky.social
@gneubig.bsky.social
@adtraghunathan.bsky.social

10/10
Overtrained Language Models Are Harder to Fine-Tune
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this ...
arxiv.org
March 26, 2025 at 6:35 PM
For the theorists in the room: we dive deeper into why this happens using a linear transfer learning setup, revealing that incremental learning leads to catastrophic overtraining.

9/10
March 26, 2025 at 6:35 PM
Fine-tuning behaves similarly: using a fixed learning rate across different pre-training checkpoints, we see eventual degradation in both task performance and web-data perplexity. This often holds even after hyperparameter tuning. Overtraining = worse fine-tuning outcomes!

8/10
March 26, 2025 at 6:35 PM
👉 Early in training: Models have low sensitivity & the base model improves quickly; performance improves 📈
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉

7/10
March 26, 2025 at 6:35 PM
What's happening? Beyond Gaussian perturbations, extended pre-training increases model sensitivity to all types of parameter updates 👇

6/10
March 26, 2025 at 6:35 PM
🔹 Early checkpoints: Robust to parameter changes.
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)

5/10
March 26, 2025 at 6:35 PM
Let’s step back and consider a simpler setting: we train our own 30M parameter models and test how Gaussian noise affects model parameters at different pre-training stages👇

4/10
March 26, 2025 at 6:35 PM
Example: OLMo-1B trained on 3T tokens performs over 2% *worse* after instruction tuning than its 2.3T-token version—even though it saw 30% more data! We see similar observations for many other post-training setups.

Why does extended pre-training hurt fine-tuning performance? 🤔

3/10
March 26, 2025 at 6:35 PM
The latest language models are pre-trained on more and more tokens while holding the number of model parameters fixed—and this trend isn't slowing down!
➡️ Better base models? Yes.
➡️ Better starting point for post-training? Let’s check!

2/10
March 26, 2025 at 6:35 PM