In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.
arxiv.org/abs/2601.18026
In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.
arxiv.org/abs/2601.18026
www.commoncrawl.org/blog/host--a...
www.commoncrawl.org/blog/host--a...
www.commoncrawl.org/blog/january...
www.commoncrawl.org/blog/january...
www.commoncrawl.org/blog/web-arc...
www.commoncrawl.org/blog/web-arc...
commoncrawl.org/blog/how-seo...
commoncrawl.org/blog/how-seo...
A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.
commoncrawl.org/blog/gneissw...
A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.
commoncrawl.org/blog/gneissw...
www.commoncrawl.org/blog/common-...
www.commoncrawl.org/blog/common-...
commoncrawl.org/blog/host--a...
commoncrawl.org/blog/host--a...
commoncrawl.org/blog/decembe...
commoncrawl.org/blog/decembe...
commoncrawl.org/blog/a-sampl...
commoncrawl.org/blog/a-sampl...
In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.
In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.
commoncrawl.org/blog/host--a...
commoncrawl.org/blog/host--a...
commoncrawl.org/blog/novembe...
commoncrawl.org/blog/novembe...
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
commoncrawl.org/blog/setting...
A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.
commoncrawl.org/blog/setting...
commoncrawl.org/blog/october...
commoncrawl.org/blog/october...
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...
commoncrawl.org/blog/october...
commoncrawl.org/blog/october...
commoncrawl.org/blog/common-...
commoncrawl.org/blog/common-...