Alex Wettig
awettig.bsky.social
Alex Wettig
@awettig.bsky.social
PhD@Princeton trying to make sense of language models and their training data
Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!

💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)
February 18, 2025 at 12:31 PM
Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.

✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!
February 18, 2025 at 12:31 PM
We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.

And we can combine the topic and format predictions to curate data with even better performance! 📈
February 18, 2025 at 12:31 PM
How useful are these domains for data curation in practice?

We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"

Prediction: Heavily upsample domains such as Science or Tutorials!
February 18, 2025 at 12:31 PM
We distill the LLM outputs into small domain classifiers to annotate data at scale!

Interesting finding: our topics and formats co-occur almost independently!
February 18, 2025 at 12:31 PM
Modern pre-training relies on crawling the web to collect trillions of tokens

We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages

🔍 Explore our domains and see examples at weborganizer.allen.ai
February 18, 2025 at 12:31 PM