Lightnews — Scholar-powered news

Ari Morcos

@arimorcos.bsky.social

210 followers 100 following 4 posts

CEO and Co-founder @ DatologyAI working to make it easy for anyone to make the most of their data. Former: RS FAIR, RS DeepMind, Harvard Neuroscience PhD.

www.datologyai.com

Posts Replies Media Videos

Ari Morcos

@arimorcos.bsky.social

ICYMI, check out our latest results @datologyai.com on curating data for LLMs.

Intervening only on training data, our pipeline can train models faster (7.7x less compute), better (+8.5% performance), and smaller (models half the size outperform by >5%)!

www.datologyai.com/post/technic...

Technical Deep-Dive: Curating Our Way to a State-of-the-Art Text Dataset

Our data curation pipeline to obtain substantial improvements in LLM quality, training speed, and inference efficiency.

www.datologyai.com

November 29, 2024 at 4:36 PM

Reposted by Ari Morcos

Haoli Yin

@haoliyin.bsky.social

The text team cooked so much 🧑‍🍳 it might be better than your Thanksgiving meal

Check out this super thorough thread on what and how we achieved the best curated text dataset using public data

Matthew Leavitt @leavittron.bsky.social · Nov 25

Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵

November 25, 2024 at 8:29 PM

Reposted by Ari Morcos

Haoli Yin

@haoliyin.bsky.social

Working on making data curation dirt cheap btw

If you're a cracked engineer we'd love to have you :))
DM me if you have any questions!

jobs.ashbyhq.com/DatologyAI

(also looking for enthusiastic research interns)

DatologyAI Jobs

jobs.ashbyhq.com

November 25, 2024 at 8:37 PM

Reposted by Ari Morcos

Pratyush Maini

@pratyushmaini.bsky.social

1/5 Earlier this year, I joined @datologyai.com to give wings to the data research I had been doing in academia. Today, I am absolutely thrilled to share what we’ve been working on!

Techvember Ep 2: How we made the #1 LLM Pre-training Data Recipe.

Blog: 👉 tinyurl.com/best-llm-data 🧵

November 25, 2024 at 6:43 PM

Ari Morcos

@arimorcos.bsky.social

🚀 Train faster - Reach the same performance 7.7x faster

📈 Train Better - Improve performance by 8.5% over exact-deduplicated RPJv1, 6.1% over FineWeb-Edu, and 4.4% over DCLM

🔍 Train Smaller - Train a model that's 2.1x smaller while simultaneously improving performance by >5%

November 25, 2024 at 5:56 PM

Reposted by Ari Morcos

Ethan

@ethanrosenthal.com

Massive, impressive post on data curation strategies for producing better models with less data and compute. The best part of data curation is that it's a (relatively small) one time cost that gets amortized over all future models.

Link to the technical write-up: www.datologyai.com/post/product...

November 14, 2024 at 7:16 PM

Reposted by Ari Morcos

Josh Wills

@spite.vc

This is the most interesting and most impactful data pipeline problem I have ever worked on (and if you know me, you know that’s saying something.)

So happy to be able to share this work with the world! And now it’s time for a little vacation. 😅

Matthew Leavitt @leavittron.bsky.social · Nov 14

🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥

November 14, 2024 at 7:21 PM

Ari Morcos

@arimorcos.bsky.social

Hello bluesky! First post, first results drop!

Today, we @datologyai.bsky.social are so excited to release our first results, demonstrating *massive* gains in training efficiency, performance, and inference efficiency with better data.

www.datologyai.com/post/datolog...

November 14, 2024 at 7:37 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news