Lightnews — Scholar-powered news

Fan Zhou

@fzhou99.bsky.social

4 followers 20 following 7 posts

PhD Student. 🧑‍🍳 LLM.

Posts Replies Media Videos

Fan Zhou

@fzhou99.bsky.social

💎 Talk is cheap. Show me the numbers.

Head-to-head comparisons with open datasets show that MegaMath leads in both quality and scale: MegaMath-Web-Pro outperforms FineMath-4plus by an absolute 4% gain! Plus, MegaMath-Llama-1B and 3B — beat their Base counterparts by up to 20% across 10 benchmarks.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

For the code domain, we applied model-based retrieval to construct a high-quality, math-related code corpus. For synthetic data, we explored the best prompting strategies, improved existing pipelines, and ultimately generated over 60B tokens of synthetic math data.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

❓How we did it? 1/2

For the web domain, we downloaded all Common Crawl, developed an optimized HTML reformatting pipeline, trained a robust FastText-based filter, performed 2-stage extraction, and explored optimal deduplication strategies.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

🔍 Why MegaMath?

We’ve got numerous general pre-training datasets, but still lack a large math pre-training dataset that scales to >100B tokens. And here you are: MegaMath covers diverse math categories, suitable for pre-training, continual training, mid-training — flexible across various demands.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

🥁🥁
Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!

April 11, 2025 at 6:36 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news