- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)
Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)
Tomorrow: WebOrganizer w/ @soldaini.net & @kylelo.bsky.social
Thursday: MeCo by @gaotianyu1350.bsky.social
💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!
✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!
And we can combine the topic and format predictions to curate data with even better performance! 📈
And we can combine the topic and format predictions to curate data with even better performance! 📈
We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"
Prediction: Heavily upsample domains such as Science or Tutorials!
We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"
Prediction: Heavily upsample domains such as Science or Tutorials!
Interesting finding: our topics and formats co-occur almost independently!
Interesting finding: our topics and formats co-occur almost independently!
We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages
🔍 Explore our domains and see examples at weborganizer.allen.ai
We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages
🔍 Explore our domains and see examples at weborganizer.allen.ai
In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐
Key takeaway: domains help us curate better pre-training data! 🧵/N
In our new paper, we propose WebOrganizer which *constructs domains* based on the topic and format of CommonCrawl web pages 🌐
Key takeaway: domains help us curate better pre-training data! 🧵/N