Lightnews — Scholar-powered news

Ian Magnusson

@ianmagnusson.bsky.social

570 followers 280 following 16 posts

Science of language models @uwnlp.bsky.social and @ai2.bsky.social with @PangWeiKoh and @nlpnoah.bsky.social. https://ianmagnusson.github.io

Posts Replies Media Videos

Ian Magnusson

@ianmagnusson.bsky.social

Come chat with me at #NeurIPS2024 and learn about how to use Paloma to evaluate perplexity over hundreds of domains! ✨We have stickers too✨

December 10, 2024 at 3:54 AM

Ian Magnusson

@ianmagnusson.bsky.social

Further decomposing perplexity, we find that some vocabulary strings get worse as models scale (see examples) ✍️
Again, not always bad, but Paloma reports average loss of each vocabulary string, surfacing strings that behave differently in some domains.

December 20, 2023 at 8:33 PM

Ian Magnusson

@ianmagnusson.bsky.social

We also show that performance improves in almost all domains as models are scaled, but domains improve unequally 📈📉
Differences in improvement, such as these examples, can indicate divergence, stagnation, or saturation—not all bad, but worth investigating!

December 20, 2023 at 8:33 PM

Ian Magnusson

@ianmagnusson.bsky.social

We pretrain six 1B baselines on popular corpora 🤖

With these we find Common-Crawl-only pretraining has inconsistent fit to many domains:
1. C4 and mC4 baselines erratically worse fit than median model
2. C4, mC4, and Falcon baselines sometimes non-monotonic perplexity in Fig 1

December 20, 2023 at 8:32 PM

Ian Magnusson

@ianmagnusson.bsky.social

Along with the datasets we curate, we build eval corpora from held out Dolma data that sample:
💬 top 100 subreddits
🧑‍💻 top 100 programming languages
Different research may require other domains, but Paloma enables research on 100s of domains from existing metadata.

December 20, 2023 at 8:30 PM

Ian Magnusson

@ianmagnusson.bsky.social

We introduce guidelines and implement controls for LM experiments 📋:
1. Remove contaminated pretraining
2. Fix train order
3. Subsample eval data based on metric variance
4. Fix the vocabulary unless you study changing it
5. Standardize eval format

December 20, 2023 at 8:30 PM

Ian Magnusson

@ianmagnusson.bsky.social

Paloma benchmark results are organized by comparability of:
🧪 controls like benchmark decontamination
💸 measures of cost (parameter and training token count)

Find out more:
📃 arXiv (arxiv.org/pdf/2312.105...)
🤖 data and models (huggingface.co/collections/...)

December 20, 2023 at 8:29 PM

Ian Magnusson

@ianmagnusson.bsky.social

LMs are used to process text from many topics, styles, dialects, etc., but how well do they do?

📈 Evaluating perplexity on just one corpus like C4 doesn't tell the whole story 📉

✨📃✨
We introduce Paloma, a benchmark of 585 domains from NY Times to r/depression on Reddit.

Perplexity macro averaged over any domains within each of the 18 top-level data sources in Paloma, using baselines with pretraining controls including decontamination. Evaluating on one monolithic corpus, such as C4, does not tell the complete story of model fit. Paloma lets us see when trends differ from one distribution of language to another. For instance, the 3 baselines trained on only Common Crawl data (C4, mC4-en, Falcon-RefinedWeb) exhibit high perplexity, sometimes with non-monotonic scaling over tokens seen, on specific evaluation sources such as The Pile, Dolma, and Dolma-100-Programming-Languages.

December 20, 2023 at 8:28 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news