www.datologyai.com
Intervening only on training data, our pipeline can train models faster (7.7x less compute), better (+8.5% performance), and smaller (models half the size outperform by >5%)!
www.datologyai.com/post/technic...
Intervening only on training data, our pipeline can train models faster (7.7x less compute), better (+8.5% performance), and smaller (models half the size outperform by >5%)!
www.datologyai.com/post/technic...
Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving
That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
Check out this super thorough thread on what and how we achieved the best curated text dataset using public data
If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!
jobs.ashbyhq.com/DatologyAI
(also looking for enthusiastic research interns)
If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!
jobs.ashbyhq.com/DatologyAI
(also looking for enthusiastic research interns)
Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.
Blog: 👉 tinyurl.com/best-llm-data 🧵
Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.
Blog: 👉 tinyurl.com/best-llm-data 🧵
📈 Train Better - Improve performance by 8.5% over exact-deduplicated RPJv1, 6.1% over FineWeb-Edu, and 4.4% over DCLM
🔍 Train Smaller - Train a model that's 2.1x smaller while simultaneously improving performance by >5%
📈 Train Better - Improve performance by 8.5% over exact-deduplicated RPJv1, 6.1% over FineWeb-Edu, and 4.4% over DCLM
🔍 Train Smaller - Train a model that's 2.1x smaller while simultaneously improving performance by >5%
Link to the technical write-up: www.datologyai.com/post/product...
Link to the technical write-up: www.datologyai.com/post/product...
So happy to be able to share this work with the world! And now it’s time for a little vacation. 😅
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
So happy to be able to share this work with the world! And now it’s time for a little vacation. 😅
Today, we @datologyai.bsky.social are so excited to release our first results, demonstrating *massive* gains in training efficiency, performance, and inference efficiency with better data.
www.datologyai.com/post/datolog...
Today, we @datologyai.bsky.social are so excited to release our first results, demonstrating *massive* gains in training efficiency, performance, and inference efficiency with better data.
www.datologyai.com/post/datolog...