See you in Singapore at #ICLR2025!
Big thanks to my advisor
@fredsala.bsky.social for his guidance and to John for his contributions!
Paper: arxiv.org/abs/2412.03881
Github: github.com/SprocketLab/...
See you in Singapore at #ICLR2025!
Big thanks to my advisor
@fredsala.bsky.social for his guidance and to John for his contributions!
Paper: arxiv.org/abs/2412.03881
Github: github.com/SprocketLab/...
- Instead of just improving algorithms, focus on selecting the right data!
- Prioritizing high-overlap data sources gives us better generalization.
- Instead of just improving algorithms, focus on selecting the right data!
- Prioritizing high-overlap data sources gives us better generalization.
We frame data selection as a bandit problem, using UCB to balance exploration and exploitation across datasets. This strategically identifies and prioritize sources with high overlap density, maximizing generalization.
We frame data selection as a bandit problem, using UCB to balance exploration and exploitation across datasets. This strategically identifies and prioritize sources with high overlap density, maximizing generalization.
- Weak models can make accurate pseudolabels based on easy patterns
- Strong models leverage these labels to generalize on hard patterns.
- More overlap → better generalization
- Weak models can make accurate pseudolabels based on easy patterns
- Strong models leverage these labels to generalize on hard patterns.
- More overlap → better generalization