preslavnakov.bsky.social
@preslavnakov.bsky.social
Reposted
🌍🌎The global deduplication process was hairy 🙈 - and we want to share every detail.

The TxT360 dedup pipeline can be recreated and used for other datasets. We include our tips and tricks in a tell-all write up in the release blog:
llm360-txt360.hf.space
huggingface.co/spaces/LLM36...
TxT360: Trillion Extracted Text - a Hugging Face Space by LLM360
Discover amazing ML apps made by the community
huggingface.co
November 19, 2024 at 10:51 PM
Reposted
Building on FineWeb’s global deduplication findings, we introduce a strategic upsampling recipe which outperforms FineWeb using TxT360. Full details are in the Upsampling Experiment section of the release blog.
November 19, 2024 at 10:51 PM
Reposted
🪟🛠️LLM360 is committed to making open source AI accessible, transparent, and reproducible.

High-quality data is the first step toward better open source models...and we are excited to join the party contributing the first globally deduplicated dataset containing 5.7T tokens!
November 19, 2024 at 10:51 PM
Reposted
📢📢 Check out:

TxT360: a globally deduplicated dataset for LLM pretraining

🌐 99 Common Crawls
📘 14 Curated Sources
👨‍🍳 recipe to easily adjust data weighting and train the most performant models

Dataset:
huggingface.co/datasets/LLM...

Blog:
llm360-txt360.hf.space
November 19, 2024 at 10:51 PM
Reposted
We've made a starter pack for researchers/organizations working on open-source LLMS.

Please let us know if we missed you or if you'd like to be added!

go.bsky.app/FELkyDr
Open-source LLMs
Join the conversation
go.bsky.app
November 20, 2024 at 1:33 AM