OSCAR Project
banner
oscarproject.bsky.social
OSCAR Project
@oscarproject.bsky.social
The Open Super-large Crawled Aggregated coRpus
๐Ÿ‘€ We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community ๐Ÿ’ฌ on Discord: https://t.co/toLKAPje4E
Join the OSCAR Project Discord Server!
Check out the OSCAR Project community on Discord - hang out with 365 other members and enjoy free voice and text chat.
t.co
August 10, 2023 at 3:50 PM
โœจ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of Inria, the ALMAnaCH and CommonCrawl. Specially thanks to the contributions of @ujj.bsky.social, Rua Ismail, @sobamchan.bsky.social, Sebastian Nagel and Benoรฎt Sagot.
August 10, 2023 at 3:49 PM
As Colossal OSCAR 1.0 is based on Common Crawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual comments users agree to the Common Crawl Terms of use ๐Ÿ“„
๐Ÿ‘‰ https://commoncrawl.org/terms-of-use/
Terms of Use โ€“ Common Crawl
commoncrawl.org
August 10, 2023 at 3:46 PM
Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 Common Crawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. ๐Ÿ—ฃ๏ธ
August 10, 2023 at 3:45 PM
Colossal OSCAR 1.0 is by far our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. ๐Ÿค“๐Ÿง‘โ€๐Ÿ”ฌ๐Ÿ“Š
August 10, 2023 at 3:44 PM
๐Ÿ“ฃ The OSCAR Project and DFKI are happy to announce the release of Colossal OSCAR 1.0 ๐Ÿ“š, which is now available on the Hugging Face Hub ๐Ÿค— at https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0
Colossal OSCAR 1.0 was put together by @pjox.bsky.social as part of the OpenGPT-X collaboration.
August 10, 2023 at 3:44 PM