Lightnews — Scholar-powered news

albertge.bsky.social

@albertge.bsky.social

Thanks to my intrepid collaborators @zihengh1.bsky.social @jfrcooper2 @chu_ziyi18870 @srinath_namburi @jackcai1206 @kendallpark @nick11roberts.bsky.social berts @fredsala.bsky.social. Special thanks to @MayeeChen and members of @SprocketLab for feedback and discussion! @uwcdis @WisconsinCS

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Zooming out, it’s been very encouraging to see the recent interest in clustering-based approaches to training data. Highlighting some recent works (@shizediao, CLIMB), (@wettig, OrganizeTheWeb), (@Olivia61368522, DoGE/DGA) in this space!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Check out our paper for more technical details - we’ve got some more theoretical and empirical nuggets on how our method works: arxiv.org/abs/2505.00358 Code+datasets will be released soon! If you found this interesting, feel free to spread the word!

R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e....

arxiv.org

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Our setup is not just for language domains but is universally applicable - we extend to multimodal tasks (e.g., CLIP), as well as long reasoning traces!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

When combining regrouping and reweighting strategies, we get the best of both worlds: we match or exceed performance while requiring orders of magnitude less compute overhead when optimizing domain weights - even with as many as 100 domains!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Intuitively, we should upweight domains that best support our downstream tasks. Our approach: cluster train/eval data identically. Then, use training gradients to estimate domain alignments while accounting for eval data composition, and optimize weights accordingly.

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Still, optimizing data mixtures typically requires expensive evaluation passes. Our efficiency hack is to use domain gradients collected during training for two purposes: training the model AND estimating optimal proportions!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

🔍How many groups are optimal? There is a "sweet spot" in data mixing! Model performance shows a U-shaped relationship with the number of clusters—too few or too many hurt performance. Their geometry matters too: well-separated, compact clusters are better!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Paper: arxiv.org/abs/2505.00358

Take the Dolly-15k instruction set. Instead of human-defined categories, we repartition the data into semantic categories. Training on these newly-discovered domains results in better evaluation performance.

May 8, 2025 at 5:01 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news