I feel spectral metrics can go a long way in unlocking LLM understanding+design. 🚀
I feel spectral metrics can go a long way in unlocking LLM understanding+design. 🚀
@natolambert.bsky.social + the OLMo team!
Paper 📝: arxiv.org/abs/2509.23024
👩💻 Code : Coming soon! 👨💻
@natolambert.bsky.social + the OLMo team!
Paper 📝: arxiv.org/abs/2509.23024
👩💻 Code : Coming soon! 👨💻
@melodylizx.bsky.social @kumarkagrawal.bsky.social Komal Teru @glajoie.bsky.social @adamsantoro.bsky.social @tyrellturing.bsky.social
at @mila-quebec.bsky.social @berkeleyair.bsky.social @cohere.com & @googleresearch.bsky.social!
🧵9/9
@melodylizx.bsky.social @kumarkagrawal.bsky.social Komal Teru @glajoie.bsky.social @adamsantoro.bsky.social @tyrellturing.bsky.social
at @mila-quebec.bsky.social @berkeleyair.bsky.social @cohere.com & @googleresearch.bsky.social!
🧵9/9
- Pretraining: Compress → Expand (Memorize) → Compress (Generalize).
- Post-training: SFT/DPO → Expand; RLVR → Consolidate.
Representation geometry offers insights into when models memorize vs. generalize! 🤓
🧵8/9
- Pretraining: Compress → Expand (Memorize) → Compress (Generalize).
- Post-training: SFT/DPO → Expand; RLVR → Consolidate.
Representation geometry offers insights into when models memorize vs. generalize! 🤓
🧵8/9
On SciQ:
- Removing top 10/50 directions barely hurts accuracy.✅
- Retaining only top 10/50 directions CRUSHES accuracy.📉
As supported by our theoretical results, eigenspectrum tail encodes critical task information! 🤯
🧵7/9
On SciQ:
- Removing top 10/50 directions barely hurts accuracy.✅
- Retaining only top 10/50 directions CRUSHES accuracy.📉
As supported by our theoretical results, eigenspectrum tail encodes critical task information! 🤯
🧵7/9
We show, both through theory and with simulations in a toy model, that these non-monotonic spectral changes occur due to gradient descent dynamics with cross-entropy loss under 2 conditions:
1. skewed token frequencies
2. representation bottlenecks
🧵6/9
We show, both through theory and with simulations in a toy model, that these non-monotonic spectral changes occur due to gradient descent dynamics with cross-entropy loss under 2 conditions:
1. skewed token frequencies
2. representation bottlenecks
🧵6/9
- SFT & DPO exhibit entropy-seeking expansion, favoring instruction memorization but reducing OOD robustness.📈
- RLVR exhibits compression-seeking consolidation, learning reward-aligned behaviors at the cost of reduced exploration.📉
🧵5/9
- SFT & DPO exhibit entropy-seeking expansion, favoring instruction memorization but reducing OOD robustness.📈
- RLVR exhibits compression-seeking consolidation, learning reward-aligned behaviors at the cost of reduced exploration.📉
🧵5/9
- Entropy-seeking: Correlates with short-sequence memorization (♾️-gram alignment).
- Compression-seeking: Correlates with dramatic gains in long-context factual reasoning, e.g. TriviaQA.
Curious about ♾️-grams?
See: bsky.app/profile/liuj...
🧵4/9
- Entropy-seeking: Correlates with short-sequence memorization (♾️-gram alignment).
- Compression-seeking: Correlates with dramatic gains in long-context factual reasoning, e.g. TriviaQA.
Curious about ♾️-grams?
See: bsky.app/profile/liuj...
🧵4/9
Warmup: Rapid compression, collapsing representation to dominant directions.
Entropy-seeking: Manifold expansion, adding info in non-dominant directions.📈
Compression-seeking: Anisotropic consolidation, selectively packing more info in dominant directions.📉
🧵3/9
Warmup: Rapid compression, collapsing representation to dominant directions.
Entropy-seeking: Manifold expansion, adding info in non-dominant directions.📈
Compression-seeking: Anisotropic consolidation, selectively packing more info in dominant directions.📉
🧵3/9
BUT
🎢The spectral metrics (RankMe, αReQ) change non-monotonically (with more pretraining)!
Takeaway: We discover geometric phases of LLM learning!
🧵2/9
BUT
🎢The spectral metrics (RankMe, αReQ) change non-monotonically (with more pretraining)!
Takeaway: We discover geometric phases of LLM learning!
🧵2/9
- Spectral Decay Rate, αReQ: Fraction of variance in non-dominant directions.
- RankMe: Effective Rank; #dims truly active.
⬇️αReQ ⇒ ⬆️RankMe ⇒ More complex!
🧵1/9
- Spectral Decay Rate, αReQ: Fraction of variance in non-dominant directions.
- RankMe: Effective Rank; #dims truly active.
⬇️αReQ ⇒ ⬆️RankMe ⇒ More complex!
🧵1/9