Guilherme Penedo
guilherme.hf.co
Guilherme Penedo
@guilherme.hf.co
ML Research Engineer at 🤗. Lisboeta 🇵🇹
We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!
December 8, 2024 at 9:19 AM
Announcing 🥂 FineWeb2: A sparkling update with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other datasets.
December 8, 2024 at 9:19 AM