Anton
anton-l.bsky.social
Anton
@anton-l.bsky.social
Feeding LLMs @ Hugging Face
LLM Reasoning labs will be eating good today🍔

We commandeered the HF cluster for a few days and generated 1.2M reasoning-filled solutions to 500k NuminaMath problems with DeepSeek-R1 🐳
Have fun!
February 12, 2025 at 2:36 PM
We hope this dataset helps advance the performance of LLMs on Math 🚀 We’re also releasing all the ablation models in this collection, as well as the evaluation code.

Collection: huggingface.co/collections/...

Evaluation: github.com/huggingface/...
December 19, 2024 at 3:55 PM
Below is the breakdown of the performance of each data source after decontamination, FineMath 4+ outperforms all other datasets when doing continued pre-training of Llama3.2-3B-Base on 60B tokens.
December 19, 2024 at 3:55 PM
For the text extraction, we switched to Resiliparse with OWM’s pipeline.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.

huggingface.co/HuggingFaceT...
December 19, 2024 at 3:55 PM
Turns out math formatting is very important, our FineWebMath data was worse than OWM.

The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.
December 19, 2024 at 3:55 PM
Llama3 trains a DistilRoBERTa classifier to target pages with math reasoning and deduction. The process resembles FineWeb-Edu, where we train classifiers on synthetic web annotations.

The authors highlight a specialized math extractor from HTML pages to preserve the equations.
December 19, 2024 at 3:55 PM
First let’s break down how AI labs curate math pre-training datasets 🕵️

DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.
December 19, 2024 at 3:55 PM
Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵
December 19, 2024 at 3:55 PM
Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
November 25, 2024 at 5:24 PM