Anton
anton-l.bsky.social
Anton
@anton-l.bsky.social
Feeding LLMs @ Hugging Face
Stay tuned for more Open R1 updates!

huggingface.co/blog/open-r1...
Open R1: Update #2
A Blog post by Open R1 on Hugging Face
huggingface.co
February 12, 2025 at 2:36 PM
We hope this dataset helps advance the performance of LLMs on Math 🚀 We’re also releasing all the ablation models in this collection, as well as the evaluation code.

Collection: huggingface.co/collections/...

Evaluation: github.com/huggingface/...
December 19, 2024 at 3:55 PM
Below is the breakdown of the performance of each data source after decontamination, FineMath 4+ outperforms all other datasets when doing continued pre-training of Llama3.2-3B-Base on 60B tokens.
December 19, 2024 at 3:55 PM
We got two high quality datasets with 34B and 10B tokens depending on the filtering threshold (3 vs 4).

We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.
December 19, 2024 at 3:55 PM
For the text extraction, we switched to Resiliparse with OWM’s pipeline.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.

huggingface.co/HuggingFaceT...
December 19, 2024 at 3:55 PM
💡It was time to re-extract the Common Crawl data directly.

We retrieved pages from FineWeb’s URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (“{}“), a common LaTeX pattern.
December 19, 2024 at 3:55 PM
Turns out math formatting is very important, our FineWebMath data was worse than OWM.

The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.
December 19, 2024 at 3:55 PM
For FineMath, we first tried starting directly from FineWeb. Although we didn’t tailor FineWeb’s text extraction for math, the data retained enough equations.

Then we trained a fastText classifier to retrieve OWM-like data.
December 19, 2024 at 3:55 PM
Llama3 trains a DistilRoBERTa classifier to target pages with math reasoning and deduction. The process resembles FineWeb-Edu, where we train classifiers on synthetic web annotations.

The authors highlight a specialized math extractor from HTML pages to preserve the equations.
December 19, 2024 at 3:55 PM
First let’s break down how AI labs curate math pre-training datasets 🕵️

DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.
December 19, 2024 at 3:55 PM
Repo: github.com/huggingface/...

Here's how we use it for SmolLM 🤏
github.com/huggingface/...
smollm/evaluation at main · huggingface/smollm
Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm
github.com
November 25, 2024 at 5:24 PM