blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens
#dataviz am I doing this right?
blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens
#dataviz am I doing this right?
And it’s also not the only thing we’ve been working on. The rest of the team has been cooking on text curation since the beginning, so stay tuned for our text curation results coming soon for LLM pretraining!
13/n
And it’s also not the only thing we’ve been working on. The rest of the team has been cooking on text curation since the beginning, so stay tuned for our text curation results coming soon for LLM pretraining!
13/n
While not the target, working on data curation resulted in competitive/superior CLIP models. We’ve extensively benchmarked our models against external models, with >10x data efficiency and better performance. See blog post for more!
12/n
While not the target, working on data curation resulted in competitive/superior CLIP models. We’ve extensively benchmarked our models against external models, with >10x data efficiency and better performance. See blog post for more!
12/n
Even with ~2.4x FLOPs reduction, we attain absolute 13% improvement for retrieval and 9.2% for classification
11/n
Even with ~2.4x FLOPs reduction, we attain absolute 13% improvement for retrieval and 9.2% for classification
11/n
Specialized pretraining is the future, powered by curation at scale
Cost of training the smaller model also quickly amortizes over time (think millions of API calls/day)
10/n
Specialized pretraining is the future, powered by curation at scale
Cost of training the smaller model also quickly amortizes over time (think millions of API calls/day)
10/n
Productionizing models requires inference optimizations, trading off speed for lower quality and doesn’t work for overtrained generalist models.
9/n
Productionizing models requires inference optimizations, trading off speed for lower quality and doesn’t work for overtrained generalist models.
9/n
Improve model quality by up to ~13% absolute (22% relative) for the same train cost
Curation means training with in-domain data on end tasks you care about!
8/n
Improve model quality by up to ~13% absolute (22% relative) for the same train cost
Curation means training with in-domain data on end tasks you care about!
8/n
Retrieval: 28.8x-43x training speedup vs baseline
Classification: 2.88x-13x vs baseline
We filter out redundant & harmful data to achieve equal performance much faster
7/n
Retrieval: 28.8x-43x training speedup vs baseline
Classification: 2.88x-13x vs baseline
We filter out redundant & harmful data to achieve equal performance much faster
7/n
Data: image-text data from DataComp’s CommonPool up to 1B samples
Curate: separate strategies for retrieval and classification tasks
Train: CLIP ViT-B/16 & 32, 8k batch
Evals: DataComp+SugarCrepe
6/n
Data: image-text data from DataComp’s CommonPool up to 1B samples
Curate: separate strategies for retrieval and classification tasks
Train: CLIP ViT-B/16 & 32, 8k batch
Evals: DataComp+SugarCrepe
6/n
It’s a data pipeline that composes of cutting-edge research put in production and battle tested at scale. Shown below are a few themes that individual algorithms fall under and more details can be found in the blog post!
5/n
It’s a data pipeline that composes of cutting-edge research put in production and battle tested at scale. Shown below are a few themes that individual algorithms fall under and more details can be found in the blog post!
5/n
It’s a pivotal time in AI to unlock tremendous societal value & train domain-specific foundation models outside of large labs
With scalable data curation, we can:
1) train models **faster**
2) train models **better**
3) train models **smaller**
4/n
It’s a pivotal time in AI to unlock tremendous societal value & train domain-specific foundation models outside of large labs
With scalable data curation, we can:
1) train models **faster**
2) train models **better**
3) train models **smaller**
4/n
Pretraining data curation is the real secret sauce of frontier labs.
For the data research that’s published, it either:
1) doesn’t work
2) doesn’t work at scale
3) isn’t efficient, reliable, or robust enough to productionize
3/n
Pretraining data curation is the real secret sauce of frontier labs.
For the data research that’s published, it either:
1) doesn’t work
2) doesn’t work at scale
3) isn’t efficient, reliable, or robust enough to productionize
3/n