Haoli Yin
haoliyin.bsky.social
Haoli Yin
@haoliyin.bsky.social
multimodal data curation @datologyai.com. https://haoliyin.me
looks like my reach on Twitter is low 😅
December 4, 2024 at 10:44 PM
Was working on some model inference optimization research (speculative decoding) but in the multimodal setting with vision-language models (i.e. conditioned on images)

blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens

#dataviz am I doing this right?
November 24, 2024 at 8:31 AM
Final Note: this is the worst we’ll ever be.

And it’s also not the only thing we’ve been working on. The rest of the team has been cooking on text curation since the beginning, so stay tuned for our text curation results coming soon for LLM pretraining!

13/n
November 14, 2024 at 5:30 PM
(Bonus!) Pretrain best-in-class models

While not the target, working on data curation resulted in competitive/superior CLIP models. We’ve extensively benchmarked our models against external models, with >10x data efficiency and better performance. See blog post for more!

12/n
November 14, 2024 at 5:30 PM
We train ViT-S/32 (63M param) on curated data and compare against baselines trained with ViT-B/32 (151M param)

Even with ~2.4x FLOPs reduction, we attain absolute 13% improvement for retrieval and 9.2% for classification

11/n
November 14, 2024 at 5:30 PM
What if you can have a smaller, domain-specific model in the first place?

Specialized pretraining is the future, powered by curation at scale
Cost of training the smaller model also quickly amortizes over time (think millions of API calls/day)

10/n
November 14, 2024 at 5:30 PM
Claim #3: train models **smaller**

Productionizing models requires inference optimizations, trading off speed for lower quality and doesn’t work for overtrained generalist models.

9/n
November 14, 2024 at 5:30 PM
Claim #2: train models **better**

Improve model quality by up to ~13% absolute (22% relative) for the same train cost
Curation means training with in-domain data on end tasks you care about!

8/n
November 14, 2024 at 5:30 PM
Claim #1: train models **faster**

Retrieval: 28.8x-43x training speedup vs baseline
Classification: 2.88x-13x vs baseline

We filter out redundant & harmful data to achieve equal performance much faster

7/n
November 14, 2024 at 5:30 PM
What we did:

Data: image-text data from DataComp’s CommonPool up to 1B samples

Curate: separate strategies for retrieval and classification tasks

Train: CLIP ViT-B/16 & 32, 8k batch

Evals: DataComp+SugarCrepe

6/n
November 14, 2024 at 5:30 PM
What’s the product?

It’s a data pipeline that composes of cutting-edge research put in production and battle tested at scale. Shown below are a few themes that individual algorithms fall under and more details can be found in the blog post!

5/n
November 14, 2024 at 5:30 PM
Solution: @datologyai.bsky.social

It’s a pivotal time in AI to unlock tremendous societal value & train domain-specific foundation models outside of large labs

With scalable data curation, we can:
1) train models **faster**
2) train models **better**
3) train models **smaller**
4/n
November 14, 2024 at 5:30 PM
Why is Data Curation hard?

Pretraining data curation is the real secret sauce of frontier labs.

For the data research that’s published, it either:
1) doesn’t work
2) doesn’t work at scale
3) isn’t efficient, reliable, or robust enough to productionize

3/n
November 14, 2024 at 5:30 PM