Lightnews — Scholar-powered news

Alex Wettig

@awettig.bsky.social

120 followers 84 following 9 posts

PhD@Princeton trying to make sense of language models and their training data

Posts Replies Media Videos

Alex Wettig

@awettig.bsky.social

📜 Paper: arxiv.org/pdf/2502.10341
🌐 Website (feat. Domain Explorer): weborganizer.allen.ai
🤖 Models and Data: huggingface.co/WebOrganizer
💾 Code: github.com/CodeCreator...

w/amazing co-authors @kylelo.bsky.social @sewonm.bsky.social @hanna-nlp.bsky.social @danqi-chen.bsky.social @soldaini.net

GitHub - CodeCreator/WebOrganizer: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation - CodeCreator/WebOrganizer

github.com

February 18, 2025 at 12:31 PM

Alex Wettig

@awettig.bsky.social

Our domains also shine a light on which type of content is implicitly upsampled when using quality filters!

💡 FineWeb-Edu, DCLM-fasttext, and our RegMix predictions share similarities (e.g. all upsample Science topics) but also diverge (e.g. DCLM is more balanced across topics)

February 18, 2025 at 12:31 PM

Alex Wettig

@awettig.bsky.social

Instead of sampling from the domains, we can also pick the best documents according to quality filters, which improves the overall performance of two strong quality filters.

✅ Domain mixing complements quality filtering by being able to calibrate the training distribution!

February 18, 2025 at 12:31 PM

Alex Wettig

@awettig.bsky.social

We test these domain mixtures by training 1B models and find that they improve performance across a range of tasks.

And we can combine the topic and format predictions to curate data with even better performance! 📈

February 18, 2025 at 12:31 PM

Alex Wettig

@awettig.bsky.social

How useful are these domains for data curation in practice?

We leverage RegMix to study how the domains should be reweighted to benefit two downstream tasks commonly used as proxies for "data quality"

Prediction: Heavily upsample domains such as Science or Tutorials!

February 18, 2025 at 12:31 PM

Alex Wettig

@awettig.bsky.social

We distill the LLM outputs into small domain classifiers to annotate data at scale!

Interesting finding: our topics and formats co-occur almost independently!

February 18, 2025 at 12:31 PM

Alex Wettig

@awettig.bsky.social

Modern pre-training relies on crawling the web to collect trillions of tokens

We craft careful descriptions of topic and format categories and prompt an LLM to structure this loose collection of web pages

🔍 Explore our domains and see examples at weborganizer.allen.ai

February 18, 2025 at 12:31 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news