albertge.bsky.social
@albertge.bsky.social
Our setup is not just for language domains but is universally applicable - we extend to multimodal tasks (e.g., CLIP), as well as long reasoning traces!
May 8, 2025 at 5:01 PM
When combining regrouping and reweighting strategies, we get the best of both worlds: we match or exceed performance while requiring orders of magnitude less compute overhead when optimizing domain weights - even with as many as 100 domains!
May 8, 2025 at 5:01 PM
Intuitively, we should upweight domains that best support our downstream tasks. Our approach: cluster train/eval data identically. Then, use training gradients to estimate domain alignments while accounting for eval data composition, and optimize weights accordingly.
May 8, 2025 at 5:01 PM
🔍How many groups are optimal? There is a "sweet spot" in data mixing! Model performance shows a U-shaped relationship with the number of clusters—too few or too many hurt performance. Their geometry matters too: well-separated, compact clusters are better!
May 8, 2025 at 5:01 PM
Paper: arxiv.org/abs/2505.00358

Take the Dolly-15k instruction set. Instead of human-defined categories, we repartition the data into semantic categories. Training on these newly-discovered domains results in better evaluation performance.
May 8, 2025 at 5:01 PM
Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability

Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!
May 8, 2025 at 5:01 PM