Fan Zhou
fzhou99.bsky.social
Fan Zhou
@fzhou99.bsky.social
PhD Student. 🧑‍🍳 LLM.
🧑‍💻 ONE MORE THING. Please bear my endless hype: We are still working hard to deliver something more! Please stay tuned!
April 11, 2025 at 6:36 PM
⛽️ It may seem very straightforward but I swear it is not that easy as you would imagine. We learned a lot from this journey. Bitter lessons are many. We try to report these details in our report👇

Feel free to check:
🤗 hf.co/datasets/LLM...
📝 hf.co/papers/2504....
💻 github.com/LLM360/MegaM...
LLM360/MegaMath · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
hf.co
April 11, 2025 at 6:36 PM
💎 Talk is cheap. Show me the numbers.

Head-to-head comparisons with open datasets show that MegaMath leads in both quality and scale: MegaMath-Web-Pro outperforms FineMath-4plus by an absolute 4% gain! Plus, MegaMath-Llama-1B and 3B — beat their Base counterparts by up to 20% across 10 benchmarks.
April 11, 2025 at 6:36 PM
For the code domain, we applied model-based retrieval to construct a high-quality, math-related code corpus. For synthetic data, we explored the best prompting strategies, improved existing pipelines, and ultimately generated over 60B tokens of synthetic math data.
April 11, 2025 at 6:36 PM
❓How we did it? 1/2

For the web domain, we downloaded all Common Crawl, developed an optimized HTML reformatting pipeline, trained a robust FastText-based filter, performed 2-stage extraction, and explored optimal deduplication strategies.
April 11, 2025 at 6:36 PM
🔍 Why MegaMath?

We’ve got numerous general pre-training datasets, but still lack a large math pre-training dataset that scales to >100B tokens. And here you are: MegaMath covers diverse math categories, suitable for pre-training, continual training, mid-training — flexible across various demands.
April 11, 2025 at 6:36 PM