Matthew Leavitt
banner
leavittron.bsky.social
Matthew Leavitt
@leavittron.bsky.social
Chief Science Officer, Co-Founder @datologyai
Former: Head of Data Research @MosaicML; FAIR.
views are from nowhere
We can also use our data curation to train better, smaller models to save on inference: a 1.3B model trained on 180B tokens of our data has better 5-shot performance than every 2.7B model we trained on public data sets, on token-matched (NOT FLOPs-matched) basis. FLOPs-matched is even better
November 25, 2024 at 5:49 PM
Our curated data also allows us to train faster! We save 86.9% on compute (7.7x speedup) training a 2.7B model on our data to reach the same avg 5-shot accuracy as training on RPJv1 for 180B tokens, and save 70.1% on compute (3.4x speedup) to reach the same accuracy as DCLM
November 25, 2024 at 5:49 PM
Interestingly, we also find that starting with a larger dataset to curate yields a much better final dataset.
November 25, 2024 at 5:49 PM
With our curated data we were able to train better models: 8.4 percentage-point (pp) mean 5-shot improvement over RPJv1, +6.1pp vs FineWeb-Edu (FW-Edu), and +4.4pp vs DCLM. This is no small feat: FineWeb, FineWeb-Edu, and DCLM are VERY high-quality, meticulously-curated datasets
November 25, 2024 at 5:49 PM
Our data curation pipeline is a scalable, productionized system that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for foundation model pretraining. And with it, we developed a single recipe that we used to to curate RPJv1
November 25, 2024 at 5:49 PM
tl;dr: We transformed RedPajama-v1 (RPJv1) into a dataset that outperforms FineWeb-Edu and DCLM, two of the strongest publicly-available text pretraining datasets. Let me walk you through how we did it
November 25, 2024 at 5:49 PM
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
November 25, 2024 at 5:49 PM
HUGE shoutout to Haoli Yin, Amro Abbas, and (Evil) Josh Wills for leading this work. You did an amazing job! Oh, and stay tuned for more announcements from us. Our curation pipeline works for text, too 😉
November 14, 2024 at 5:16 PM
One component of our pipeline is synthetic image recaptioning, so we compare to strong methods like MetaCLIPv2 & LaCLIP. And our retrieval-optimized data outperforms both of them on retrieval tasks, despite their models training for 2.5x more samples and using 4x the batch size
November 14, 2024 at 5:16 PM
And our classification-optimized dataset gets better performance (absolute and normalized—see the explanation in the table) than any other DataComp Large submission.
November 14, 2024 at 5:16 PM
But how does our curation pipeline stack up against published research? We also compared to a menagerie of other models. Compared to external ViT-B/32 models, we achieve superior retrieval performance, even to models trained for over 6x longer and on datasets over ~4x larger
November 14, 2024 at 5:16 PM
And our curation did rather well 🙂 So what _is_ our curation? It’s a scalable, productionized pipeline that integrates a suite of bleeding-edge algorithms to curate data in the quantity necessary for foundation model pretraining
November 14, 2024 at 5:16 PM
We were able to save up to ~98% on compute (43x training speedup), improve CLIP ViT-B/32 model quality by up to 13 percentage points, and train a ViT-S/32 (a model w/ less than half the parameters of a ViT-B/32) to MUCH higher quality, all by training on well-curated data
November 14, 2024 at 5:16 PM
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
November 14, 2024 at 5:16 PM