Common Crawl Foundation
@commoncrawl.bsky.social
Common Crawl is a non-profit foundation dedicated to the Open Web.
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
November 6, 2025 at 2:56 PM
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
Setting the Record Straight
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
commoncrawl.org/blog/setting...
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
commoncrawl.org/blog/setting...
Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activiti...
commoncrawl.org
November 4, 2025 at 10:38 PM
Setting the Record Straight
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
commoncrawl.org/blog/setting...
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
commoncrawl.org/blog/setting...
Check out our newsletter for October/November 2025, with updates on what we've been up to
commoncrawl.org/blog/october...
commoncrawl.org/blog/october...
Common Crawl - Blog - October/November 2025 Newsletter
Check out our newsletter for October/November 2025, with updates on what we've been up to
commoncrawl.org
November 4, 2025 at 10:37 PM
Check out our newsletter for October/November 2025, with updates on what we've been up to
commoncrawl.org/blog/october...
commoncrawl.org/blog/october...
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
commoncrawl.org
October 29, 2025 at 6:46 PM
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.
Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...
commoncrawl.org
October 29, 2025 at 6:45 PM
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.
commoncrawl.org/blog/october...
commoncrawl.org/blog/october...
Common Crawl - Blog - October 2025 Crawl Archive Now Available
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.
commoncrawl.org
October 29, 2025 at 6:45 PM
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.
commoncrawl.org/blog/october...
commoncrawl.org/blog/october...
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at COLM 2025
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.
commoncrawl.org
October 21, 2025 at 3:51 PM
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
Reposted by Common Crawl Foundation
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
October 10, 2025 at 8:52 PM
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
Reposted by Common Crawl Foundation
Thank you everyone for coming to WMDQS (pronounced "whim ducks")!
October 10, 2025 at 8:51 PM
Thank you everyone for coming to WMDQS (pronounced "whim ducks")!
Reposted by Common Crawl Foundation
After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!
October 10, 2025 at 8:46 PM
After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!
Reposted by Common Crawl Foundation
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
October 10, 2025 at 4:18 PM
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
Reposted by Common Crawl Foundation
Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🤩
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
October 9, 2025 at 11:16 PM
Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🤩
Reposted by Common Crawl Foundation
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
October 9, 2025 at 8:17 PM
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.
commoncrawl.org/blog/announc...
commoncrawl.org/blog/announc...
Common Crawl - Blog - Announcing GneissWeb Annotations
Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.
commoncrawl.org
October 6, 2025 at 11:26 AM
Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.
commoncrawl.org/blog/announc...
commoncrawl.org/blog/announc...
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.
commoncrawl.org/blog/web-lan...
commoncrawl.org/blog/web-lan...
Common Crawl - Blog - Web Languages Needing Review by Native Speakers
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links...
commoncrawl.org
October 2, 2025 at 10:06 PM
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.
commoncrawl.org/blog/web-lan...
commoncrawl.org/blog/web-lan...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.
Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9...
commoncrawl.org
October 2, 2025 at 10:04 PM
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.
The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers.
commoncrawl.org/blog/from-se...
commoncrawl.org/blog/from-se...
Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data
The era of traditional search engine optimization is rapidly evolving into
commoncrawl.org
October 2, 2025 at 10:03 PM
The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers.
commoncrawl.org/blog/from-se...
commoncrawl.org/blog/from-se...
We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.
www.commoncrawl.org/blog/septemb...
www.commoncrawl.org/blog/septemb...
Common Crawl - Blog - September 2025 Crawl Archive Now Available
We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.
www.commoncrawl.org
September 23, 2025 at 3:14 PM
We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.
www.commoncrawl.org/blog/septemb...
www.commoncrawl.org/blog/septemb...
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we ...
commoncrawl.org
September 18, 2025 at 6:36 AM
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.
commoncrawl.org/blog/trip-re...
commoncrawl.org/blog/trip-re...
Common Crawl - Blog - Trip Report: AI_dev (Linux Foundation) August 2025
On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.
commoncrawl.org
September 18, 2025 at 6:33 AM
On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.
commoncrawl.org/blog/trip-re...
commoncrawl.org/blog/trip-re...
On October 22, the Common Crawl team will lead a seminar at Stanford HAI. Our topic of discussion is “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
Please register at: hai.stanford.edu/events/commo...
Please register at: hai.stanford.edu/events/commo...
Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI
Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.
hai.stanford.edu
September 18, 2025 at 6:31 AM
On October 22, the Common Crawl team will lead a seminar at Stanford HAI. Our topic of discussion is “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.
Please register at: hai.stanford.edu/events/commo...
Please register at: hai.stanford.edu/events/commo...
We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else
www.techdirt.com/2025/09/08/w...
www.techdirt.com/2025/09/08/w...
We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else
A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guard—until I realized what h…
www.techdirt.com
September 9, 2025 at 4:06 PM
We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else
www.techdirt.com/2025/09/08/w...
www.techdirt.com/2025/09/08/w...
Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ways to preserve and share humanity’s knowledge.
www.commoncrawl.org/blog/common-...
www.commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation
Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ...
www.commoncrawl.org
September 9, 2025 at 4:05 PM
Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ways to preserve and share humanity’s knowledge.
www.commoncrawl.org/blog/common-...
www.commoncrawl.org/blog/common-...
We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.
commoncrawl.org/blog/july-au...
commoncrawl.org/blog/july-au...
Common Crawl - Blog - July/August 2025 Newsletter
We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.
commoncrawl.org
August 26, 2025 at 9:34 PM
We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.
commoncrawl.org/blog/july-au...
commoncrawl.org/blog/july-au...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025.
commoncrawl.org/blog/host--a...
commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. The host-level graph consists of 691.1 million nodes and 5.0 bill...
commoncrawl.org
August 22, 2025 at 12:41 AM
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025.
commoncrawl.org/blog/host--a...
commoncrawl.org/blog/host--a...