Lightnews — Scholar-powered news

An Dinh, Shrestha Mohanty, Deividas Mataciunas,
Tobin South, Jianguo Zhang,
@arielnlee.bsky.social , Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma

8/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

🎉 big thanks to all the contributors to this huge and magnificent effort. I'm truly honored for the chance to work alongside all of you: Manan Dey, Nayan Saxena,
Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang

7/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

This work was supported by the Mozilla Foundation Data Futures Lab, and was lead by: @shaynelongpre.bsky.social, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska,
William Brannon, and Robert Mahari

6/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

4️⃣ Linguistic representation has not improved by most measures: Gini Coefficients for text and speech datasets show significant concentration, indicating limited progress in diversifying data sources.

5/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

3️⃣ Geographical representation has not improved for a decade: Datasets from African and South American organizations account for < 0.2% of all modality content, while North American or European organizations span 93% of text tokens and 60%+ hours of speech and video.

4/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

2️⃣ Inconsistent dataset licenses: While ~30% of datasets have permissive licenses, 78%+ of their sources carry hidden anti-crawling or licensing restrictions, making compliance a minefield.

3/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

📌 Key Findings

1️⃣ The web is still the primary source: The internet, social media platforms, and synthetically generated data are increasingly becoming the predominant sources for multimodal data, compared to curated sources.

2/n

December 19, 2024 at 4:34 PM

Mohammed Hamdy

@mmhamdy.bsky.social

EPIC! 🤗

December 11, 2024 at 12:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news