Lightnews — Scholar-powered news

Haoli Yin

@haoliyin.bsky.social

ensembling logits (e.g. avg modality embeddings) from contrastively trained models actually can achieve this in certain settings. Previous work on a specific downstream task: arxiv.org/abs/2310.18812

To truly do this in early fusion models you'd have to capture synergy (arxiv.org/abs/2306.04539)

‪UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification‬

‪J Crawford, H Yin, L McDermott, D Cummings‬, ‪NeurIPS 2023 UniReps Workshop, 2023‬ - ‪Cited by 3‬

scholar.google.com

December 5, 2024 at 1:34 AM

Haoli Yin

@haoliyin.bsky.social

I think that was the main point of the Cambrian paper:
cambrian-mllm.github.io

Cambrian-1: A Fully Open Vision-Centric Exploration of MLLMs

Cambrian-1 is a family of multimodal LLMs with a vision-centric design. We also release CV-Bench, a new vision-centric benchmark, and Cambrian-10M, a multimodal instruction-tuning dataset.

cambrian-mllm.github.io

December 5, 2024 at 12:04 AM

Haoli Yin

@haoliyin.bsky.social

looks like my reach on Twitter is low 😅

December 4, 2024 at 10:44 PM

Haoli Yin

@haoliyin.bsky.social

Ah so some details I left out:

- I set first n tokens to be generated by target model where here n=3.
- I'm using Qwen2-VL family here
- Prompt is "Describe this image" so the first three tokens are always the same

This was just to baseline and now to experiment with various tasks and configs

November 24, 2024 at 8:57 PM

Haoli Yin

@haoliyin.bsky.social

hopefully this side project will get to a point where there's something novel to write up 😅

November 24, 2024 at 8:31 AM

Haoli Yin

@haoliyin.bsky.social

🙋🏻‍♂️

November 22, 2024 at 3:35 AM

Haoli Yin

@haoliyin.bsky.social

If you've made it this far, you clearly recognize the immense potential of data curation and our team.

For researchers/engineers/anons: Excited about multimodal data? Have innovative ideas? Join us!

(also recruiting cracked interns)
jobs.ashbyhq.com/DatologyAI
14/n

DatologyAI Jobs

jobs.ashbyhq.com

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

Final Note: this is the worst we’ll ever be.

And it’s also not the only thing we’ve been working on. The rest of the team has been cooking on text curation since the beginning, so stay tuned for our text curation results coming soon for LLM pretraining!

13/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

(Bonus!) Pretrain best-in-class models

While not the target, working on data curation resulted in competitive/superior CLIP models. We’ve extensively benchmarked our models against external models, with >10x data efficiency and better performance. See blog post for more!

12/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

We train ViT-S/32 (63M param) on curated data and compare against baselines trained with ViT-B/32 (151M param)

Even with ~2.4x FLOPs reduction, we attain absolute 13% improvement for retrieval and 9.2% for classification

11/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

What if you can have a smaller, domain-specific model in the first place?

Specialized pretraining is the future, powered by curation at scale
Cost of training the smaller model also quickly amortizes over time (think millions of API calls/day)

10/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

Claim #3: train models **smaller**

Productionizing models requires inference optimizations, trading off speed for lower quality and doesn’t work for overtrained generalist models.

9/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

Claim #2: train models **better**

Improve model quality by up to ~13% absolute (22% relative) for the same train cost
Curation means training with in-domain data on end tasks you care about!

8/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

Claim #1: train models **faster**

Retrieval: 28.8x-43x training speedup vs baseline
Classification: 2.88x-13x vs baseline

We filter out redundant & harmful data to achieve equal performance much faster

7/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

What we did:

Data: image-text data from DataComp’s CommonPool up to 1B samples

Curate: separate strategies for retrieval and classification tasks

Train: CLIP ViT-B/16 & 32, 8k batch

Evals: DataComp+SugarCrepe

6/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

What’s the product?

It’s a data pipeline that composes of cutting-edge research put in production and battle tested at scale. Shown below are a few themes that individual algorithms fall under and more details can be found in the blog post!

5/n

November 14, 2024 at 5:30 PM

Haoli Yin

@haoliyin.bsky.social

Solution: @datologyai.bsky.social

It’s a pivotal time in AI to unlock tremendous societal value & train domain-specific foundation models outside of large labs

With scalable data curation, we can:
1) train models **faster**
2) train models **better**
3) train models **smaller**
4/n

November 14, 2024 at 5:30 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news