banner
sebnagel.bsky.social
@sebnagel.bsky.social
Reposted
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.

commoncrawl.org/blog/web-lan...
Common Crawl - Blog - Web Languages Needing Review by Native Speakers
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links...
commoncrawl.org
October 2, 2025 at 10:06 PM
Reposted
"MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common Crawl data set."
Unlocking data to advance European commerce and culture - Microsoft On the Issues
Microsoft launches 2 initiatives to open Europe’s languages and culture, building on AI, cloud, and digital sovereignty commitments.
blogs.microsoft.com
July 21, 2025 at 5:45 PM
Reposted
The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.

commoncrawl.org/blog/wmdqs-s...
Common Crawl - Blog - WMDQS Shared Task on Language Identification
The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identi...
commoncrawl.org
July 21, 2025 at 10:34 PM
Reposted
Apache Nutch 1.21 is now available for download: buff.ly/juTZlwE

Nutch is a well matured, production ready #web crawler. #opensource
July 31, 2025 at 8:45 AM
Reposted
🚨🚀 Looking for a comparative dataset on social media platforms? We’re excited to launch COMPARE! This is a collaborative effort by @nilsweidmann.bsky.social , @friederikeq.bsky.social , @sebnagel.bsky.social , @yannistheocharis.bsky.social & Molly Roberts. 🧵⤵️ (1/5)
May 28, 2025 at 8:39 AM
Reposted
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org
1st Workshop on Multilingual Data Quality Signals
wmdqs.org
May 29, 2025 at 5:18 PM
Reposted
The #UniKonstanz Cluster of Excellence "The Politics of Inequality @excinequality.bsky.social will continue to receive funding in the context of the German #ExcellenceStrategy @dfg.de @wissenschaftsrat.de. Details: t1p.de/ouhyj
May 22, 2025 at 3:09 PM