OSCAR Project
banner
oscarproject.bsky.social
OSCAR Project
@oscarproject.bsky.social
The Open Super-large Crawled Aggregated coRpus
👀 We're working on many new features for you, currently we're focusing on improving language identification, so if you want to help or contribute, please join our community 💬 on Discord: https://t.co/toLKAPje4E
Join the OSCAR Project Discord Server!
Check out the OSCAR Project community on Discord - hang out with 365 other members and enjoy free voice and text chat.
t.co
August 10, 2023 at 3:50 PM
✨ Colossal OSCAR 1.0 has also been made possible thanks to the continuous support of Inria, the ALMAnaCH and CommonCrawl. Specially thanks to the contributions of @ujj.bsky.social, Rua Ismail, @sobamchan.bsky.social, Sebastian Nagel and Benoît Sagot.
August 10, 2023 at 3:49 PM
As Colossal OSCAR 1.0 is based on Common Crawl, our annotations are distributed under CC0 (Creative Commons Zero) license, however for the textual comments users agree to the Common Crawl Terms of use 📄
👉 https://commoncrawl.org/terms-of-use/
Terms of Use – Common Crawl
commoncrawl.org
August 10, 2023 at 3:46 PM
Colossal OSCAR 1.0 is just a partial annotation of the WET files of 10 Common Crawl snapshots, the original data is included only for convenience, and specially for researchers looking for data in lower resource languages. 🗣️
August 10, 2023 at 3:45 PM
Colossal OSCAR 1.0 is by far our largest release so far, being almost 10 times as big as previous releases. We're still working on statistics and documentation so please bear with us while we finish these for you in the coming days and weeks. 🤓🧑‍🔬📊
August 10, 2023 at 3:44 PM