Lightnews — Scholar-powered news

Jacob Springer

@jacobspringer.bsky.social

140 followers 130 following 10 posts

Machine Learning (the science part) | PhD student @ CMU

Posts Replies Media Videos

Jacob Springer

@jacobspringer.bsky.social

We also have so many other interesting details in the paper that have entirely changed the way I think about pre-training!

And thanks to my collaborators!
Sachin Goyal
Kaiyue Wen
Tanishq Kumar
@xiangyue96.bsky.social
@sadhika.bsky.social
@gneubig.bsky.social
@adtraghunathan.bsky.social

10/10

Overtrained Language Models Are Harder to Fine-Tune

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this ...

arxiv.org

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

For the theorists in the room: we dive deeper into why this happens using a linear transfer learning setup, revealing that incremental learning leads to catastrophic overtraining.

9/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

Fine-tuning behaves similarly: using a fixed learning rate across different pre-training checkpoints, we see eventual degradation in both task performance and web-data perplexity. This often holds even after hyperparameter tuning. Overtraining = worse fine-tuning outcomes!

8/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

👉 Early in training: Models have low sensitivity & the base model improves quickly; performance improves 📈
👉 Late in training: Models become highly sensitive & the base model improves slowly; performance degrades! 📉

7/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

What's happening? Beyond Gaussian perturbations, extended pre-training increases model sensitivity to all types of parameter updates 👇

6/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

🔹 Early checkpoints: Robust to parameter changes.
🔸 Later checkpoints: Highly sensitive, leading to worse performance after perturbation! (Left plot: sensitivity increases over training, Right plot: final performance eventually degrades.)

5/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

Let’s step back and consider a simpler setting: we train our own 30M parameter models and test how Gaussian noise affects model parameters at different pre-training stages👇

4/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

Example: OLMo-1B trained on 3T tokens performs over 2% *worse* after instruction tuning than its 2.3T-token version—even though it saw 30% more data! We see similar observations for many other post-training setups.

Why does extended pre-training hurt fine-tuning performance? 🤔

3/10

March 26, 2025 at 6:35 PM

Jacob Springer

@jacobspringer.bsky.social

The latest language models are pre-trained on more and more tokens while holding the number of model parameters fixed—and this trend isn't slowing down!
➡️ Better base models? Yes.
➡️ Better starting point for post-training? Let’s check!

2/10

March 26, 2025 at 6:35 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news