Lightnews — Scholar-powered news

Carlos Solís

@csolisr.azkware.net

Ah, and I was about to download #PleIAs myself to test it. The AGPL share-alike restriction I don't mind, the problem is the non-commercial-licensed data would taint the license of the output. Any plans to filter the #CommonCorpus even further to prevent these issues? @dorialexander.bsky.social

michael bommarito @mjbommar.bsky.social · Dec 3

i really want to support the pleias common corpus project as an example of what to do, but the reality is that it's full of the same AGPL, non-commercial or share-alike, etc...from a legal perspective, this is no different than The Pile, and it's certainly not unrestricted.

February 14, 2025 at 8:25 PM

ethicalabs.ai

@ethicalabs.bsky.social

Pleias is a large language model trained exclusively on open data. It was developed using the Common Corpus, a dataset that addresses the need for high-quality compliant training data in AI development. huggingface.co/blog/Pclangl...

#opensourcellm #opendata #commoncorpus #llm #ai #ml

They Said It Couldn’t Be Done

A Blog post by Pierre-Carl Langlais on Hugging Face

huggingface.co

February 14, 2025 at 7:39 PM

Carlos Solís

@csolisr.azkware.net

About the only project I'm aware of that cares about data provenance is CommonCorpus

March 18, 2025 at 6:56 PM

Carlos Solís

@csolisr.hub.azkware.net.ap.brid.gy

The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on […]

Original post on hub.azkware.net

hub.azkware.net

May 29, 2025 at 1:37 AM

Nate Angell

@xolotl.org

happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also openly licensed works (eg, shared with @creativecommons.bsky.social licenses — which are open, but still copyrighted and not PD) huggingface.co/collections/...

Common Corpus - a PleIAs Collection

The largest public domain dataset for training LLMs.

huggingface.co

March 21, 2024 at 1:46 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news