Collection: huggingface.co/collections/...
Evaluation: github.com/huggingface/...
Collection: huggingface.co/collections/...
Evaluation: github.com/huggingface/...
We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.
We also augment the datasets by filtering the English text subset of InfiMM-WebMath-40B with our math classifier and adding it to FineMath.
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.
huggingface.co/HuggingFaceT...
We then trained a classifier on Llama3's annotations to find pages with math reasoning and applied it in two stages. This helped us identify key math domains and recall high quality math data.
huggingface.co/HuggingFaceT...
We retrieved pages from FineWeb’s URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (“{}“), a common LaTeX pattern.
We retrieved pages from FineWeb’s URLs to retain its high-quality data. Then, we added back the math pages that earlier FineWeb filters have removed, such as those containing curly braces (“{}“), a common LaTeX pattern.
The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.
The classifier was mostly retrieving academic papers because math forums weren’t properly extracted with Trafilatura, and most equations needed better formatting.
Then we trained a fastText classifier to retrieve OWM-like data.
Then we trained a fastText classifier to retrieve OWM-like data.
The authors highlight a specialized math extractor from HTML pages to preserve the equations.
The authors highlight a specialized math extractor from HTML pages to preserve the equations.
DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.
DeepSeekMath and QwenMath train a fastText classifier on data like OpenWebMath (OWM). They iteratively filter and recall math content from Common Crawl, focusing on the most relevant domains.