Common Crawl Foundation
banner
commoncrawl.bsky.social
Common Crawl Foundation
@commoncrawl.bsky.social
Common Crawl is a non-profit foundation dedicated to the Open Web.
Reposted by Common Crawl Foundation
Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...
arxiv.org
February 13, 2026 at 7:27 PM
Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!
February 10, 2026 at 8:45 PM
The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.

www.commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026
The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level n...
www.commoncrawl.org
February 2, 2026 at 6:01 PM
We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

www.commoncrawl.org/blog/january...
Common Crawl - Blog - January 2026 Crawl Archive Now Available
We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.
www.commoncrawl.org
February 2, 2026 at 6:01 PM
Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

www.commoncrawl.org/blog/web-arc...
Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol
Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.
www.commoncrawl.org
February 2, 2026 at 6:00 PM
As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org/blog/how-seo...
Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals
As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.
commoncrawl.org
January 21, 2026 at 1:18 AM
GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org/blog/gneissw...
Common Crawl - Blog - GneissWeb Annotations Examples
A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.
commoncrawl.org
January 16, 2026 at 1:27 PM
From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

www.commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025
From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.
www.commoncrawl.org
January 8, 2026 at 1:49 PM
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.
commoncrawl.org
January 2, 2026 at 12:17 AM
The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org/blog/decembe...
Common Crawl - Blog - December 2025 Crawl Archive Now Available
The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).
commoncrawl.org
January 2, 2026 at 12:17 AM
As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.

commoncrawl.org/blog/a-sampl...
Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl
As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and refere...
commoncrawl.org
December 18, 2025 at 5:26 PM
Reposted by Common Crawl Foundation
A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled “Common Crawl: open web data for everybody.”

In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.
November 27, 2025 at 1:06 PM
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.

commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...
commoncrawl.org
November 24, 2025 at 5:46 PM
We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org/blog/novembe...
Common Crawl - Blog - November 2025 Crawl Archive Now Available
We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.
commoncrawl.org
November 24, 2025 at 3:49 PM
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

commoncrawl.org/blog/common-...
November 6, 2025 at 2:56 PM
Setting the Record Straight

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

commoncrawl.org/blog/setting...
Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activiti...
commoncrawl.org
November 4, 2025 at 10:38 PM
Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org/blog/october...
Common Crawl - Blog - October/November 2025 Newsletter
Check out our newsletter for October/November 2025, with updates on what we've been up to
commoncrawl.org
November 4, 2025 at 10:37 PM
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
commoncrawl.org
October 29, 2025 at 6:46 PM
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.
Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...
commoncrawl.org
October 29, 2025 at 6:45 PM
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

commoncrawl.org/blog/october...
Common Crawl - Blog - October 2025 Crawl Archive Now Available
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.
commoncrawl.org
October 29, 2025 at 6:45 PM
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at COLM 2025
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.
commoncrawl.org
October 21, 2025 at 3:51 PM
Reposted by Common Crawl Foundation
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
October 10, 2025 at 8:52 PM
Reposted by Common Crawl Foundation
Thank you everyone for coming to WMDQS (pronounced "whim ducks")!
October 10, 2025 at 8:51 PM
Reposted by Common Crawl Foundation
After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!
October 10, 2025 at 8:46 PM
Reposted by Common Crawl Foundation
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
October 10, 2025 at 4:18 PM