Bogdan Gaza
b0gdang.bsky.social
Bogdan Gaza
@b0gdang.bsky.social
co-founder & CTO @datologyai.com
Reposted by Bogdan Gaza
Quit freaking out. Remember that in 10,000 B.C., when America had ZERO international trade, a family could afford a house like this on a single income.
April 4, 2025 at 9:25 PM
Reposted by Bogdan Gaza
Me showing Claude what I've been working on
March 20, 2025 at 7:02 PM
Reposted by Bogdan Gaza
i am sick of “more monkeys jumping on the bed” discourse. it’s as though these people have no memory of 2017 when one fell off and bumped his head. doctor spoke out against it, mama endorsed doctor’s findings. i’m limiting replies to followers because i do not have the energy for YMMJOTBers today
March 8, 2025 at 8:56 PM
Reposted by Bogdan Gaza
Buckle up because we're banging into the new year with my annual retrospective of the last year in databases! Highlights include license change blowback, Databricks vs. Snowflake gangwar, @duckdb.org's shotgun weddings, and buying a quarterback to impress your lover: www.cs.cmu.edu/~pavlo/blog/...
Databases in 2024: A Year in Review
Andy rises from the ashes of his dead startup and discusses what happened in 2024 in the database game.
www.cs.cmu.edu
January 1, 2025 at 2:02 PM
The new family Christmas Eve tradition: watching Verandah Santa and The Sign episodes from Bluey!
December 25, 2024 at 3:47 AM
See you at AWS re:Invent next week! If you're in Vegas happy to catch up on anything data curation related!
December 1, 2024 at 5:08 PM
Reposted by Bogdan Gaza
Words have no meaning anymore.
quite the claim from Microsoft here 🤔
November 26, 2024 at 3:44 PM
Reposted by Bogdan Gaza
I am excited about the release of our results on web-scale text data curation @datologyai.com. Our curation pipeline transforms the RedPajama V1 dataset into the DAIT dataset which outperforms the best publicly-available pretraining datasets for training LLMs better, faster, smaller.
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
November 25, 2024 at 7:46 PM
Reposted by Bogdan Gaza
Tired: Bringing up politics at Thanksgiving

Wired: Bringing up @datologyai.com’s new text curation results at Thanksgiving

That’s right, we applied our data curation pipeline to text pretraining data and the results are hot enough to roast a 🦃
🧵
November 25, 2024 at 5:49 PM
Reposted by Bogdan Gaza
If you're interested in Data-Centric AI, follow The DatologyAI Starter Pack for damn-good data memes and occasional data curation insights: go.bsky.app/NJ9sTot
DatologyAI Starter Pack
Join the conversation
go.bsky.app
November 22, 2024 at 8:04 PM
Reposted by Bogdan Gaza
Amazon S3 just grew "append"! It's only available for the more expensive, lower latency S3 Express One Zone bucket class but you can now append data to an object up to 10,000 times - previously you could only atomically replace a whole object with an updated version simonwillison.net/2024/Nov/22/...
Amazon S3 Express One Zone now supports the ability to append data to an object
This is a first for Amazon S3: it is now possible to append data to an existing object in a bucket, where previously the only supported operation was to atomically …
simonwillison.net
November 22, 2024 at 4:47 AM
Reposted by Bogdan Gaza
This is the most interesting and most impactful data pipeline problem I have ever worked on (and if you know me, you know that’s saying something.)

So happy to be able to share this work with the world! And now it’s time for a little vacation. 😅
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
November 14, 2024 at 7:21 PM
Reposted by Bogdan Gaza
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
November 14, 2024 at 5:16 PM
Reposted by Bogdan Gaza
Web-Scale Data Curation is a frontier challenge - I'm excited to show the progress we've made in just 6 months @datologyai

tl;dr: we've pretrained the most data-efficient and best-in-class CLIP models!

Read on to see how our product powers multimodal data curation
1/n 🧵
🧵We’ve spent the last few months at @datologyai.bsky.social
building a state-of-the-art data curation pipeline and I’m SO excited to share our first results: we curated image-text pretraining data and massively improved CLIP model quality, training speed, and inference efficiency 🔥🔥🔥
November 14, 2024 at 5:30 PM