Lightnews — Scholar-powered news

albertge.bsky.social

@albertge.bsky.social

Our setup is not just for language domains but is universally applicable - we extend to multimodal tasks (e.g., CLIP), as well as long reasoning traces!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

When combining regrouping and reweighting strategies, we get the best of both worlds: we match or exceed performance while requiring orders of magnitude less compute overhead when optimizing domain weights - even with as many as 100 domains!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Intuitively, we should upweight domains that best support our downstream tasks. Our approach: cluster train/eval data identically. Then, use training gradients to estimate domain alignments while accounting for eval data composition, and optimize weights accordingly.

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

🔍How many groups are optimal? There is a "sweet spot" in data mixing! Model performance shows a U-shaped relationship with the number of clusters—too few or too many hurt performance. Their geometry matters too: well-separated, compact clusters are better!

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Paper: arxiv.org/abs/2505.00358

Take the Dolly-15k instruction set. Instead of human-defined categories, we repartition the data into semantic categories. Training on these newly-discovered domains results in better evaluation performance.

May 8, 2025 at 5:01 PM

albertge.bsky.social

@albertge.bsky.social

Online data mixing reduces training costs for foundation models, but faces challenges:
⚠️ Human-defined domains miss semantic nuances
⚠️ Limited eval accessibility
⚠️ Poor scalability

Introducing 🎵R&B: first regroup data, then dynamically reweight domains during training!

May 8, 2025 at 5:01 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news