Lightnews — Scholar-powered news

Anton

@anton-l.bsky.social

Stay tuned for more Open R1 updates!

huggingface.co/blog/open-r1...

Open R1: Update #2

A Blog post by Open R1 on Hugging Face

huggingface.co

February 12, 2025 at 2:36 PM

Anton

@anton-l.bsky.social

🤗 Dataset: huggingface.co/datasets/ope...

open-r1/OpenR1-Math-Raw · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

February 12, 2025 at 2:36 PM

Anton

@anton-l.bsky.social

We hope this dataset helps advance the performance of LLMs on Math 🚀 We’re also releasing all the ablation models in this collection, as well as the evaluation code.

Collection: huggingface.co/collections/...

Evaluation: github.com/huggingface/...

FineMath - The finest collection of mathematical content

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

Below is the breakdown of the performance of each data source after decontamination, FineMath 4+ outperforms all other datasets when doing continued pre-training of Llama3.2-3B-Base on 60B tokens.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

We got two high quality datasets with 34B and 10B tokens depending on the filtering threshold (3 vs 4).

We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

For the text extraction, we switched to Resiliparse with OWM’s pipeline.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.

huggingface.co/HuggingFaceT...

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

💡It was time to re-extract the Common Crawl data directly.

We retrieved pages from FineWeb’s URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (“{}“), a common LaTeX pattern.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

Turns out math formatting is very important, our FineWebMath data was worse than OWM.

The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

For FineMath, we first tried starting directly from FineWeb. Although we didn’t tailor FineWeb’s text extraction for math, the data retained enough equations.

Then we trained a fastText classifier to retrieve OWM-like data.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

Llama3 trains a DistilRoBERTa classifier to target pages with math reasoning and deduction. The process resembles FineWeb-Edu, where we train classifiers on synthetic web annotations.

The authors highlight a specialized math extractor from HTML pages to preserve the equations.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

First let’s break down how AI labs curate math pre-training datasets 🕵️

DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.

December 19, 2024 at 3:55 PM

Anton

@anton-l.bsky.social

Repo: github.com/huggingface/...

Here's how we use it for SmolLM 🤏
github.com/huggingface/...

smollm/evaluation at main · huggingface/smollm

Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm

github.com

November 25, 2024 at 5:24 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news