Haoli Yin
haoliyin.bsky.social
Haoli Yin
@haoliyin.bsky.social
multimodal data curation @datologyai.com. https://haoliyin.me
Pinned
Web-Scale Data Curation is a frontier challenge - I'm excited to show the progress we've made in just 6 months @datologyai

tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!

Read on to see how our product powers multimodal data curation
1/n 🧵
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
For more details about where I'll be, come visit me at
@datologyai.com's booth 303

Times I’ll be there (in local time):
- Tuesday Dec 10th, 12pm-4pm
- Wednesday Dec 11th, 1pm-5pm
- Thursday Dec 12th, 9am-12:30pm

#neurips
December 8, 2024 at 3:54 AM
I'll be at NeurIPS next week starting Tuesday! Please reach out if you want to talk anything multimodal, data curation, synthetic data, and inference optimizations.

I'd love to learn more about your research area as well :))
December 4, 2024 at 6:27 AM
Reposted by Haoli Yin
I'm re-sharing some recent blog posts on using VLMs for synthetic data generation since there are no link penalties here!

How to generate a dataset of queries for training and fine-tuning domain-specific ColPali models using a VLM.

🔗 danielvanstrien.xyz/posts/post-w...
Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset – Daniel van Strien
Learn how to generate custom ColPali dataset using an open VLM for multimodal retrieval model training and fine-tuning.
danielvanstrien.xyz
November 25, 2024 at 12:31 PM
Working on making data curation dirt cheap btw

If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!

jobs.ashbyhq.com/DatologyAI

(also looking for enthusiastic research interns)
DatologyAI Jobs
DatologyAI Jobs
jobs.ashbyhq.com
November 25, 2024 at 8:37 PM
The text team cooked so much 🧑‍🍳 it might be better than your Thanksgiving meal

Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
November 25, 2024 at 8:29 PM
Was working on some model inference optimization research (speculative decoding) but in the multimodal setting with vision-language models (i.e. conditioned on images)

blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens

#dataviz am I doing this right?
November 24, 2024 at 8:31 AM
now using uv for any new project and trying to migrate existing projects to uv

Starting a new project:
uv init
uv venv --python 3.xx,
source .venv/bin/activate
uv add (dependencies) or uv pip install -r requirements.txt

❤️
- installing torch in like 10 seconds
- uv sync for fast startup
Looks like uv is the #1 trending Rust repo over the last month 🚀🚀🚀
November 21, 2024 at 8:22 PM
Reposted by Haoli Yin
Massive, impressive post on data curation strategies for producing better models with less data and compute. The best part of data curation is that it's a (relatively small) one time cost that gets amortized over all future models.

Link to the technical write-up: www.datologyai.com/post/product...
November 14, 2024 at 7:16 PM
Web-Scale Data Curation is a frontier challenge - I'm excited to show the progress we've made in just 6 months @datologyai

tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!

Read on to see how our product powers multimodal data curation
1/n 🧵
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
November 14, 2024 at 5:30 PM
Pouring one out for my bluesky x data gang 🍻 Get ready to see the culmination of the web-scale multimodal data curation work we've been cooking up at DatologyAI!

I'm beyond excited to share soon :))

if there's enough interest, might drop something earlier 👀
November 11, 2024 at 5:44 AM
hello world! I was convinced by @codestar.bsky.social so lets see how much signal is here
October 29, 2024 at 5:41 AM