Aldo Gael Carranza
agcrnz.bsky.social
Aldo Gael Carranza
@agcrnz.bsky.social
I am excited about the release of our results on web-scale text data curation @datologyai.com. Our curation pipeline transforms the RedPajama V1 dataset into the DAIT dataset which outperforms the best publicly-available pretraining datasets for training LLMs better, faster, smaller.
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
November 25, 2024 at 7:46 PM
Hello, bluesky! Testing
November 25, 2024 at 7:43 PM