#commoncorpus
Ah, and I was about to download #PleIAs myself to test it. The AGPL share-alike restriction I don't mind, the problem is the non-commercial-licensed data would taint the license of the output. Any plans to filter the #CommonCorpus even further to prevent these issues? @dorialexander.bsky.social
i really want to support the pleias common corpus project as an example of what to do, but the reality is that it's full of the same AGPL, non-commercial or share-alike, etc...from a legal perspective, this is no different than The Pile, and it's certainly not unrestricted.
February 14, 2025 at 8:25 PM
Pleias is a large language model trained exclusively on open data. It was developed using the Common Corpus, a dataset that addresses the need for high-quality compliant training data in AI development. huggingface.co/blog/Pclangl...

#opensourcellm #opendata #commoncorpus #llm #ai #ml
They Said It Couldn’t Be Done
A Blog post by Pierre-Carl Langlais on Hugging Face
huggingface.co
February 14, 2025 at 7:39 PM
About the only project I'm aware of that cares about data provenance is CommonCorpus
March 18, 2025 at 6:56 PM
The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on […]
Original post on hub.azkware.net
hub.azkware.net
May 29, 2025 at 1:37 AM
happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also openly licensed works (eg, shared with @creativecommons.bsky.social licenses — which are open, but still copyrighted and not PD) huggingface.co/collections/...
Common Corpus - a PleIAs Collection
The largest public domain dataset for training LLMs.
huggingface.co
March 21, 2024 at 1:46 AM