Lightnews — Scholar-powered news

Reposted by Common Crawl Foundation

eleutherai.bsky.social

@eleutherai.bsky.social

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

arxiv.org

February 13, 2026 at 7:27 PM

Common Crawl Foundation

@commoncrawl.bsky.social

Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!

Examples of mislabeled web text by existing LangID systems. A full text version is available on the blog post below.

February 10, 2026 at 8:45 PM

Common Crawl Foundation

@commoncrawl.bsky.social

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.

www.commoncrawl.org/blog/host--a...

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level n...

www.commoncrawl.org

February 2, 2026 at 6:01 PM

Common Crawl Foundation

@commoncrawl.bsky.social

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

www.commoncrawl.org/blog/january...

Common Crawl - Blog - January 2026 Crawl Archive Now Available

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

www.commoncrawl.org

February 2, 2026 at 6:01 PM

Common Crawl Foundation

@commoncrawl.bsky.social

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

www.commoncrawl.org/blog/web-arc...

Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

www.commoncrawl.org

February 2, 2026 at 6:00 PM

Common Crawl Foundation

@commoncrawl.bsky.social

As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org/blog/how-seo...

Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals

As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org

January 21, 2026 at 1:18 AM

Common Crawl Foundation

@commoncrawl.bsky.social

GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org/blog/gneissw...

Common Crawl - Blog - GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org

January 16, 2026 at 1:27 PM

Common Crawl Foundation

@commoncrawl.bsky.social

From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

www.commoncrawl.org/blog/common-...

Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025

From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

www.commoncrawl.org

January 8, 2026 at 1:49 PM

Common Crawl Foundation

@commoncrawl.bsky.social

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org/blog/host--a...

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org

January 2, 2026 at 12:17 AM

Common Crawl Foundation

@commoncrawl.bsky.social

The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org/blog/decembe...

Common Crawl - Blog - December 2025 Crawl Archive Now Available

The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org

January 2, 2026 at 12:17 AM

Common Crawl Foundation

@commoncrawl.bsky.social

As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.

commoncrawl.org/blog/a-sampl...

Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl

As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and refere...

commoncrawl.org

December 18, 2025 at 5:26 PM

Reposted by Common Crawl Foundation

Jean Golding Institute

@jgibristol.bsky.social

A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled “Common Crawl: open web data for everybody.”

In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.

Laurie Burchell at a lectern presenting her Turing Seminar talk

Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk

November 27, 2025 at 1:06 PM

Common Crawl Foundation

@commoncrawl.bsky.social

We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.

commoncrawl.org/blog/host--a...

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025

We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...

commoncrawl.org

November 24, 2025 at 5:46 PM

Common Crawl Foundation

@commoncrawl.bsky.social

We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org/blog/novembe...

Common Crawl - Blog - November 2025 Crawl Archive Now Available

We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org

November 24, 2025 at 3:49 PM

Common Crawl Foundation

@commoncrawl.bsky.social

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

commoncrawl.org/blog/common-...

Banner for the World Digital Preservation Day, 6th of November 2025

November 6, 2025 at 2:56 PM

Common Crawl Foundation

@commoncrawl.bsky.social

Setting the Record Straight

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

commoncrawl.org/blog/setting...

Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activiti...

commoncrawl.org

November 4, 2025 at 10:38 PM

Common Crawl Foundation

@commoncrawl.bsky.social

Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org/blog/october...

Common Crawl - Blog - October/November 2025 Newsletter

Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org

November 4, 2025 at 10:37 PM

Common Crawl Foundation

@commoncrawl.bsky.social

The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org/blog/common-...

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI

The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org

October 29, 2025 at 6:46 PM

Common Crawl Foundation

@commoncrawl.bsky.social

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...

commoncrawl.org

October 29, 2025 at 6:45 PM

Common Crawl Foundation

@commoncrawl.bsky.social

We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

commoncrawl.org/blog/october...

Common Crawl - Blog - October 2025 Crawl Archive Now Available

We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

commoncrawl.org

October 29, 2025 at 6:45 PM

Common Crawl Foundation

@commoncrawl.bsky.social

The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

commoncrawl.org/blog/common-...

Common Crawl - Blog - Common Crawl Foundation at COLM 2025

The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

commoncrawl.org

October 21, 2025 at 3:51 PM