Lightnews — Scholar-powered news

Arna Ghosh

@arnaghosh.bsky.social

I got you 😉

November 8, 2025 at 8:14 AM

Arna Ghosh

@arnaghosh.bsky.social

Indeed! We show in the paper that the DPO objective is analogous to contrastive learning objectives used for self-supervised vision pretraining, which is indeed entropy-seeking in nature (shown in prev works).

I feel spectral metrics can go a long way in unlocking LLM understanding+design. 🚀

November 3, 2025 at 1:51 AM

Arna Ghosh

@arnaghosh.bsky.social

Takeaway: LLM training exhibits multi-phasic information geometry changes! ✨

- Pretraining: Compress → Expand (Memorize) → Compress (Generalize).

- Post-training: SFT/DPO → Expand; RLVR → Consolidate.

Representation geometry offers insights into when models memorize vs. generalize! 🤓

🧵8/9

The multi-phasic information geometry changes in LLM pretraining and post-training. Pretraining undergoes an initial warmup phase, which corresponds to echolalia behavior, followed by entropy-seeking where the model learns high-frequency n-gram statistics, and finally a compression-seeking phase, where the model learns long-range dependencies. The post-training stages of SFT and DPO exhibit entropy-seeking behavior, where the model memorizes instruction following behavior, whereas RLVR exhibits compress-seeking behavior, where the model learns generalized reasoning at the cost of exploration.

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

Why do these geometric phases arise?🤔

We show, both through theory and with simulations in a toy model, that these non-monotonic spectral changes occur due to gradient descent dynamics with cross-entropy loss under 2 conditions:

1. skewed token frequencies
2. representation bottlenecks

🧵6/9

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

Post-training also yields distinct geometric signatures:

- SFT & DPO exhibit entropy-seeking expansion, favoring instruction memorization but reducing OOD robustness.📈

- RLVR exhibits compression-seeking consolidation, learning reward-aligned behaviors at the cost of reduced exploration.📉

🧵5/9

Supervised Finetuning (SFT) exhibits entropy-seeking expansion, coupled with decreased OOD robustness, whereas Reinforcement Learning from Verifiable Rewards (RLVR) exhibits compression-seeking consolidation, coupled with reduced exploration.

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

How do these phases relate to LLM behavior?

- Entropy-seeking: Correlates with short-sequence memorization (♾️-gram alignment).

- Compression-seeking: Correlates with dramatic gains in long-context factual reasoning, e.g. TriviaQA.

Curious about ♾️-grams?
See: bsky.app/profile/liuj...
🧵4/9

Different geometric phases correspond to acquiring different behaviors: entropy-seeking phase correlates with increased short-sequence memorization, compression-seeking phase correlates with better TriviaQA performance.

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

LLMs have 3 pretraining phases:

Warmup: Rapid compression, collapsing representation to dominant directions.

Entropy-seeking: Manifold expansion, adding info in non-dominant directions.📈

Compression-seeking: Anisotropic consolidation, selectively packing more info in dominant directions.📉

🧵3/9

OLMo-2 and Pythia models undergo multiple distinct geometric phases during pretraining, indicating a non-monotonic change in representation complexity underlying monotonic decrease in training loss.

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

📐We measured representation complexity using the #eigenspectrum of the final layer representations. We used 2 spectral metrics:

- Spectral Decay Rate, αReQ: Fraction of variance in non-dominant directions.

- RankMe: Effective Rank; #dims truly active.

⬇️αReQ ⇒ ⬆️RankMe ⇒ More complex!

🧵1/9

Spectral decomposition methods and metrics used to quantify representation space complexity in LLMs.

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

LLMs are trained to compress data by mapping sequences to high-dim representations!
How does the complexity of this mapping change across LLM training? How does it relate to the model’s capabilities? 🤔
Announcing our #NeurIPS2025 📄 that dives into this.

🧵below
#AIResearch #MachineLearning #LLM

New paper titled "Tracing the Representation Geometry of Language Models from Pretraining to Post-training" by Melody Z Li, Kumar K Agrawal, Arna Ghosh, Komal K Teru, Adam Santoro, Guillaume Lajoie, Blake A Richards.

October 31, 2025 at 4:19 PM

Arna Ghosh

@arnaghosh.bsky.social

Are you training self-supervised/foundation models, and worried if they are learning good representations? We got you covered! 💪
🦖Introducing Reptrix, a #Python library to evaluate representation quality metrics for neural nets: github.com/BARL-SSL/rep...
🧵👇[1/6]
#DeepLearning

April 1, 2025 at 6:24 PM

Arna Ghosh

@arnaghosh.bsky.social

Just over a week since I defended my 🤖+🧠PhD thesis, and the feeling is just sinking in. Extremely grateful to
@tyrellturing.bsky.social for supporting me through this amazing journey! 🙏
Big thanks to all members of the LiNC lab, and colleagues at mcgill University and @mila-quebec.bsky.social. ❤️😁

February 10, 2025 at 12:36 PM

Arna Ghosh

@arnaghosh.bsky.social

🧵 7/8 Combining smaller projectors, higher # augmentations, and optimal orthogonalization advances the compute-performance Pareto frontier in SSL: 50% less compute for the same downstream performance or better accuracy at a fixed compute budget. 📈

December 13, 2024 at 3:44 AM

Arna Ghosh

@arnaghosh.bsky.social

🧵 6/8 Multiple augmentations lead to better approximation of the underlying data similarity kernel. More views → better kernel estimation → improved convergence and sample efficiency, matching larger datasets in low-data settings.
4 views 50% data 🤝 2 views 100% data. 📊

December 13, 2024 at 3:44 AM

Arna Ghosh

@arnaghosh.bsky.social

🧵 5/8 Surprising result: Smaller projector dimensions (reducing projector params by up to 100x) + stronger orthogonalization can match the performance of high-dim projectors! We can get the same representation quality while using less compute 🔥.

December 13, 2024 at 3:44 AM

Arna Ghosh

@arnaghosh.bsky.social

🧵 4/8 First key finding: Gradient descent's implicit bias reveals a sweet spot in feature learning. Too little orthogonalization → feature collapse. Too much → unstable learning dynamics.We characterized this trade-off for harnessing the value of small projectors 🎯

December 13, 2024 at 3:44 AM

Arna Ghosh

@arnaghosh.bsky.social

🧵2/8 While SSL algos like SimCLR/VICReg use diff losses, they optimize the same objective: match the similarity struct given by augmentations. Most are limited to 2 views & need massive compute. We derive an equivalent but computationally efficient formulation of this loss. 🔍

December 13, 2024 at 3:44 AM

Arna Ghosh

@arnaghosh.bsky.social

The problem with current SSL? It's hungry. Very hungry. 🤖

Training time: Weeks
Dataset size: Millions of images
Compute costs: 💸💸💸

Our #NeurIPS2024 poster makes SSL pipelines 2x faster and achieves similar accuracy at 50% pretraining cost! 💪🏼✨
🧵 1/8

December 13, 2024 at 3:44 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news