Lightnews — Scholar-powered news

Matthew Leavitt

@leavittron.bsky.social

190 followers 40 following 38 posts

Chief Science Officer, Co-Founder @datologyai
Former: Head of Data Research @MosaicML; FAIR.
views are from nowhere

Posts Replies Media Videos

Matthew Leavitt

@leavittron.bsky.social

We can also use our data curation to train better, smaller models to save on inference: a 1.3B model trained on 180B tokens of our data has better 5-shot performance than every 2.7B model we trained on public data sets, on token-matched (NOT FLOPs-matched) basis. FLOPs-matched is even better

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

Our curated data also allows us to train faster! We save 86.9% on compute (7.7x speedup) training a 2.7B model on our data to reach the same avg 5-shot accuracy as training on RPJv1 for 180B tokens, and save 70.1% on compute (3.4x speedup) to reach the same accuracy as DCLM

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

Interestingly, we also find that starting with a larger dataset to curate yields a much better final dataset.

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

With our curated data we were able to train better models: 8.4 percentage-point (pp) mean 5-shot improvement over RPJv1, +6.1pp vs FineWeb-Edu (FW-Edu), and +4.4pp vs DCLM. This is no small feat: FineWeb, FineWeb-Edu, and DCLM are VERY high-quality, meticulously-curated datasets

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

Our data curation pipeline is a scalable, productionized system that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for foundation model pretraining. And with it, we developed a single recipe that we used to to curate RPJv1

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

tl;dr: We transformed RedPajama-v1 (RPJv1) into a dataset that outperforms FineWeb-Edu and DCLM, two of the strongest publicly-available text pretraining datasets. Let me walk you through how we did it

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵

November 25, 2024 at 5:49 PM

Matthew Leavitt

@leavittron.bsky.social

HUGE shoutout to Haoli Yin, Amro Abbas, and (Evil) Josh Wills for leading this work. You did an amazing job! Oh, and stay tuned for more announcements from us. Our curation pipeline works for text, too 😉

November 14, 2024 at 5:16 PM

Matthew Leavitt

@leavittron.bsky.social

One component of our pipeline is synthetic image recaptioning, so we compare to strong methods like MetaCLIPv2 & LaCLIP. And our retrieval-optimized data outperforms both of them on retrieval tasks, despite their models training for 2.5x more samples and using 4x the batch size

November 14, 2024 at 5:16 PM

Matthew Leavitt

@leavittron.bsky.social

And our classification-optimized dataset gets better performance (absolute and normalized—see the explanation in the table) than any other DataComp Large submission.

November 14, 2024 at 5:16 PM

Matthew Leavitt

@leavittron.bsky.social

But how does our curation pipeline stack up against published research? We also compared to a menagerie of other models. Compared to external ViT-B/32 models, we achieve superior retrieval performance, even to models trained for over 6x longer and on datasets over ~4x larger

November 14, 2024 at 5:16 PM

Matthew Leavitt

@leavittron.bsky.social

And our curation did rather well 🙂 So what _is_ our curation? It’s a scalable, productionized pipeline that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for foundation model pretraining

November 14, 2024 at 5:16 PM

Matthew Leavitt

@leavittron.bsky.social

We were able to save up to ~98% on compute (43x training speedup), improve CLIP ViT-B/32 model quality by up to 13 percentage points, and train a ViT-S/32 (a model w/ less than half the parameters of a ViT-B/32) to MUCH higher quality, all by training on well-curated data

November 14, 2024 at 5:16 PM

Matthew Leavitt

@leavittron.bsky.social

🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥

November 14, 2024 at 5:16 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news