Lightnews — Scholar-powered news

Reposted

LLM360

@llm360.bsky.social

🌍🌎The global deduplication process was hairy 🙈 - and we want to share every detail.

The TxT360 dedup pipeline can be recreated and used for other datasets. We include our tips and tricks in a tell-all write up in the release blog:
llm360-txt360.hf.space
huggingface.co/spaces/LLM36...

TxT360: Trillion Extracted Text - a Hugging Face Space by LLM360

Discover amazing ML apps made by the community

huggingface.co

November 19, 2024 at 10:51 PM

Reposted

LLM360

@llm360.bsky.social

Building on FineWeb’s global deduplication findings, we introduce a strategic upsampling recipe which outperforms FineWeb using TxT360. Full details are in the Upsampling Experiment section of the release blog.

November 19, 2024 at 10:51 PM

Reposted

LLM360

@llm360.bsky.social

🪟🛠️LLM360 is committed to making open source AI accessible, transparent, and reproducible.

High-quality data is the first step toward better open source models...and we are excited to join the party contributing the first globally deduplicated dataset containing 5.7T tokens!

November 19, 2024 at 10:51 PM

Reposted

LLM360

@llm360.bsky.social

📢📢 Check out:

TxT360: a globally deduplicated dataset for LLM pretraining

🌐 99 Common Crawls
📘 14 Curated Sources
👨‍🍳 recipe to easily adjust data weighting and train the most performant models

Dataset:
huggingface.co/datasets/LLM...

Blog:
llm360-txt360.hf.space

Banner image showing the TxT360 project.

November 19, 2024 at 10:51 PM

Reposted

LLM360

@llm360.bsky.social

We've made a starter pack for researchers/organizations working on open-source LLMS.

Please let us know if we missed you or if you'd like to be added!

go.bsky.app/FELkyDr

Open-source LLMs

Join the conversation

go.bsky.app

November 20, 2024 at 1:33 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news