albertge.bsky.social
@albertge.bsky.social
Thanks to my intrepid collaborators @zihengh1.bsky.social @jfrcooper2 @chu_ziyi18870 @srinath_namburi @jackcai1206 @kendallpark @nick11roberts.bsky.social berts @fredsala.bsky.social. Special thanks to @MayeeChen and members of @SprocketLab for feedback and discussion! @uwcdis @WisconsinCS
May 8, 2025 at 5:01 PM
Zooming out, it’s been very encouraging to see the recent interest in clustering-based approaches to training data. Highlighting some recent works (@shizediao, CLIMB), (@wettig, OrganizeTheWeb), (@Olivia61368522, DoGE/DGA) in this space!
May 8, 2025 at 5:01 PM
Check out our paper for more technical details - we’ve got some more theoretical and empirical nuggets on how our method works: arxiv.org/abs/2505.00358 Code+datasets will be released soon! If you found this interesting, feel free to spread the word!
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e....
arxiv.org
May 8, 2025 at 5:01 PM
Our setup is not just for language domains but is universally applicable - we extend to multimodal tasks (e.g., CLIP), as well as long reasoning traces!
May 8, 2025 at 5:01 PM
When combining regrouping and reweighting strategies, we get the best of both worlds: we match or exceed performance while requiring orders of magnitude less compute overhead when optimizing domain weights - even with as many as 100 domains!
May 8, 2025 at 5:01 PM
Intuitively, we should upweight domains that best support our downstream tasks. Our approach: cluster train/eval data identically. Then, use training gradients to estimate domain alignments while accounting for eval data composition, and optimize weights accordingly.
May 8, 2025 at 5:01 PM
Still, optimizing data mixtures typically requires expensive evaluation passes. Our efficiency hack is to use domain gradients collected during training for two purposes: training the model AND estimating optimal proportions!
May 8, 2025 at 5:01 PM
🔍How many groups are optimal? There is a "sweet spot" in data mixing! Model performance shows a U-shaped relationship with the number of clusters—too few or too many hurt performance. Their geometry matters too: well-separated, compact clusters are better!
May 8, 2025 at 5:01 PM
Paper: arxiv.org/abs/2505.00358

Take the Dolly-15k instruction set. Instead of human-defined categories, we repartition the data into semantic categories. Training on these newly-discovered domains results in better evaluation performance.
May 8, 2025 at 5:01 PM