Haoli Yin
haoliyin.bsky.social
Haoli Yin
@haoliyin.bsky.social
multimodal data curation @datologyai.com. https://haoliyin.me
ensembling logits (e.g. avg modality embeddings) from contrastively trained models actually can achieve this in certain settings. Previous work on a specific downstream task: arxiv.org/abs/2310.18812

To truly do this in early fusion models you'd have to capture synergy (arxiv.org/abs/2306.04539)
‪UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification‬
‪J Crawford, H Yin, L McDermott, D Cummings‬, ‪NeurIPS 2023 UniReps Workshop, 2023‬ - ‪Cited by 3‬
scholar.google.com
December 5, 2024 at 1:34 AM
looks like my reach on Twitter is low 😅
December 4, 2024 at 10:44 PM
Ah so some details I left out:

- I set first n tokens to be generated by target model where here n=3.
- I'm using Qwen2-VL family here
- Prompt is "Describe this image" so the first three tokens are always the same

This was just to baseline and now to experiment with various tasks and configs
November 24, 2024 at 8:57 PM
hopefully this side project will get to a point where there's something novel to write up 😅
November 24, 2024 at 8:31 AM
🙋🏻‍♂️
November 22, 2024 at 3:35 AM
If you've made it this far, you clearly recognize the immense potential of data curation and our team.

For researchers/engineers/anons: Excited about multimodal data? Have innovative ideas? Join us!

(also recruiting cracked interns)
jobs.ashbyhq.com/DatologyAI
14/n
DatologyAI Jobs
DatologyAI Jobs
jobs.ashbyhq.com
November 14, 2024 at 5:30 PM
Final Note: this is the worst we’ll ever be.

And it’s also not the only thing we’ve been working on. The rest of the team has been cooking on text curation since the beginning, so stay tuned for our text curation results coming soon for LLM pretraining!

13/n
November 14, 2024 at 5:30 PM
(Bonus!) Pretrain best-in-class models

While not the target, working on data curation resulted in competitive/superior CLIP models. We’ve extensively benchmarked our models against external models, with >10x data efficiency and better performance. See blog post for more!

12/n
November 14, 2024 at 5:30 PM
We train ViT-S/32 (63M param) on curated data and compare against baselines trained with ViT-B/32 (151M param)

Even with ~2.4x FLOPs reduction, we attain absolute 13% improvement for retrieval and 9.2% for classification

11/n
November 14, 2024 at 5:30 PM
What if you can have a smaller, domain-specific model in the first place?

Specialized pretraining is the future, powered by curation at scale
Cost of training the smaller model also quickly amortizes over time (think millions of API calls/day)

10/n
November 14, 2024 at 5:30 PM
Claim #3: train models **smaller**

Productionizing models requires inference optimizations, trading off speed for lower quality and doesn’t work for overtrained generalist models.

9/n
November 14, 2024 at 5:30 PM
Claim #2: train models **better**

Improve model quality by up to ~13% absolute (22% relative) for the same train cost
Curation means training with in-domain data on end tasks you care about!

8/n
November 14, 2024 at 5:30 PM
Claim #1: train models **faster**

Retrieval: 28.8x-43x training speedup vs baseline
Classification: 2.88x-13x vs baseline

We filter out redundant & harmful data to achieve equal performance much faster

7/n
November 14, 2024 at 5:30 PM
What we did:

Data: image-text data from DataComp’s CommonPool up to 1B samples

Curate: separate strategies for retrieval and classification tasks

Train: CLIP ViT-B/16 & 32, 8k batch

Evals: DataComp+SugarCrepe

6/n
November 14, 2024 at 5:30 PM
What’s the product?

It’s a data pipeline that composes of cutting-edge research put in production and battle tested at scale. Shown below are a few themes that individual algorithms fall under and more details can be found in the blog post!

5/n
November 14, 2024 at 5:30 PM
Solution: @datologyai.bsky.social

It’s a pivotal time in AI to unlock tremendous societal value & train domain-specific foundation models outside of large labs

With scalable data curation, we can:
1) train models **faster**
2) train models **better**
3) train models **smaller**
4/n
November 14, 2024 at 5:30 PM