🌐 Website (feat. Domain Explorer): weborganizer.allen.ai
🤖 Models and Data: huggingface.co/WebOrganizer
💾 Code: github.com/CodeCreator...
w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net
🌐 Website (feat. Domain Explorer): weborganizer.allen.ai
🤖 Models and Data: huggingface.co/WebOrganizer
💾 Code: github.com/CodeCreator...
w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net
💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!
✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!
And we can combine the topic and format predictions to curate data with even better performance! 📈
And we can combine the topic and format predictions to curate data with even better performance! 📈
We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"
Prediction: Heavily upsample domains such as Science or Tutorials!
We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"
Prediction: Heavily upsample domains such as Science or Tutorials!
Interesting finding: our topics and formats co-occur almost independently!
Interesting finding: our topics and formats co-occur almost independently!
We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages
🔍 Explore our domains and see examples at weborganizer.allen.ai
We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages
🔍 Explore our domains and see examples at weborganizer.allen.ai