building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!
Read on to see how our product powers multimodal data curation
1/n 🧵
@datologyai.com's booth 303
Times I’ll be there (in local time):
- Tuesday Dec 10th, 12pm-4pm
- Wednesday Dec 11th, 1pm-5pm
- Thursday Dec 12th, 9am-12:30pm
#neurips
@datologyai.com's booth 303
Times I’ll be there (in local time):
- Tuesday Dec 10th, 12pm-4pm
- Wednesday Dec 11th, 1pm-5pm
- Thursday Dec 12th, 9am-12:30pm
#neurips
I'd love to learn more about your research area as well :))
I'd love to learn more about your research area as well :))
How to generate a dataset of queries for training and fine-tuning domain-specific ColPali models using a VLM.
🔗 danielvanstrien.xyz/posts/post-w...
How to generate a dataset of queries for training and fine-tuning domain-specific ColPali models using a VLM.
🔗 danielvanstrien.xyz/posts/post-w...
If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!
jobs.ashbyhq.com/DatologyAI
(also looking for enthusiastic research interns)
If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!
jobs.ashbyhq.com/DatologyAI
(also looking for enthusiastic research interns)
Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving
That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens
#dataviz am I doing this right?
blue = draft model tokens
red = target model tokens
yellow = bonus target model tokens
#dataviz am I doing this right?
Starting a new project:
uv init
uv venv --python 3.xx,
source .venv/bin/activate
uv add (dependencies) or uv pip install -r requirements.txt
❤️
- installing torch in like 10 seconds
- uv sync for fast startup
Starting a new project:
uv init
uv venv --python 3.xx,
source .venv/bin/activate
uv add (dependencies) or uv pip install -r requirements.txt
❤️
- installing torch in like 10 seconds
- uv sync for fast startup
Link to the technical write-up: www.datologyai.com/post/product...
Link to the technical write-up: www.datologyai.com/post/product...
tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!
Read on to see how our product powers multimodal data curation
1/n 🧵
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!
Read on to see how our product powers multimodal data curation
1/n 🧵
I'm beyond excited to share soon :))
if there's enough interest, might drop something earlier 👀
I'm beyond excited to share soon :))
if there's enough interest, might drop something earlier 👀