Mohammed Hamdy
mmhamdy.bsky.social
Mohammed Hamdy
@mmhamdy.bsky.social
A curious explorer of human and machine learning 🧐 🤝🤖
📅 Event on Mozilla AI discord: discord.gg/QTCRfefF?eve...

📄 ProGen paper: www.biorxiv.org/content/10.1...
Join the Mozilla AI Discord Server!
A global space for sharing and advancing open-source AI. | 3757 members
discord.gg
January 27, 2025 at 12:29 PM
And lastly, big thanks to you for making it this far 🤗, don’t forget to read the paper!

www.dataprovenance.org/Multimodal_D...

11/n
www.dataprovenance.org
December 19, 2024 at 4:34 PM
Big thanks to Melissa Heikkilä for featuring our work in MIT Tech Review.

www.technologyreview.com/2024/12/18/1...
This is where the data to build AI comes from
New findings show how the sources of data are concentrating power in the hands of the most powerful tech companies.
www.technologyreview.com
December 19, 2024 at 4:34 PM
Xuhui Zhou, Caiming Xiong, Luis Villa,
@stellaathena.bsky.social, Alex Pentland,
@sarahooker.bsky.social, Jad Kabbara

9/n
December 19, 2024 at 4:34 PM
An Dinh, Shrestha Mohanty, Deividas Mataciunas,
Tobin South, Jianguo Zhang,
@arielnlee.bsky.social , Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma

8/n
December 19, 2024 at 4:34 PM
🎉 big thanks to all the contributors to this huge and magnificent effort. I'm truly honored for the chance to work alongside all of you: Manan Dey, Nayan Saxena,
Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang

7/n
December 19, 2024 at 4:34 PM
This work was supported by the Mozilla Foundation Data Futures Lab, and was lead by: @shaynelongpre.bsky.social, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska,
William Brannon, and Robert Mahari

6/n
December 19, 2024 at 4:34 PM
4️⃣ Linguistic representation has not improved by most measures: Gini Coefficients for text and speech datasets show significant concentration, indicating limited progress in diversifying data sources.

5/n
December 19, 2024 at 4:34 PM
3️⃣ Geographical representation has not improved for a decade: Datasets from African and South American organizations account for < 0.2% of all modality content, while North American or European organizations span 93% of text tokens and 60%+ hours of speech and video.

4/n
December 19, 2024 at 4:34 PM
2️⃣ Inconsistent dataset licenses: While ~30% of datasets have permissive licenses, 78%+ of their sources carry hidden anti-crawling or licensing restrictions, making compliance a minefield.

3/n
December 19, 2024 at 4:34 PM
📌 Key Findings

1️⃣ The web is still the primary source: The internet, social media platforms, and synthetically generated data are increasingly becoming the predominant sources for multimodal data, compared to curated sources.

2/n
December 19, 2024 at 4:34 PM
EPIC! 🤗
December 11, 2024 at 12:02 PM