Fan Zhou
fzhou99.bsky.social
Fan Zhou
@fzhou99.bsky.social
PhD Student. 🧑‍🍳 LLM.
💎 Talk is cheap. Show me the numbers.

Head-to-head comparisons with open datasets show that MegaMath leads in both quality and scale: MegaMath-Web-Pro outperforms FineMath-4plus by an absolute 4% gain! Plus, MegaMath-Llama-1B and 3B — beat their Base counterparts by up to 20% across 10 benchmarks.
April 11, 2025 at 6:36 PM
For the code domain, we applied model-based retrieval to construct a high-quality, math-related code corpus. For synthetic data, we explored the best prompting strategies, improved existing pipelines, and ultimately generated over 60B tokens of synthetic math data.
April 11, 2025 at 6:36 PM
❓How we did it? 1/2

For the web domain, we downloaded all Common Crawl, developed an optimized HTML reformatting pipeline, trained a robust FastText-based filter, performed 2-stage extraction, and explored optimal deduplication strategies.
April 11, 2025 at 6:36 PM
🔍 Why MegaMath?

We’ve got numerous general pre-training datasets, but still lack a large math pre-training dataset that scales to >100B tokens. And here you are: MegaMath covers diverse math categories, suitable for pre-training, continual training, mid-training — flexible across various demands.
April 11, 2025 at 6:36 PM
🥁🥁
Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!
April 11, 2025 at 6:36 PM