Lightnews — Scholar-powered news

Fan Zhou

@fzhou99.bsky.social

4 followers 20 following 7 posts

PhD Student. 🧑‍🍳 LLM.

Posts Replies Media Videos

Fan Zhou

@fzhou99.bsky.social

🧑‍💻 ONE MORE THING. Please bear my endless hype: We are still working hard to deliver something more! Please stay tuned!

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

⛽️ It may seem very straightforward but I swear it is not that easy as you would imagine. We learned a lot from this journey. Bitter lessons are many. We try to report these details in our report👇

Feel free to check:
🤗 hf.co/datasets/LLM...
📝 hf.co/papers/2504....
💻 github.com/LLM360/MegaM...

LLM360/MegaMath · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

hf.co

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

💎 Talk is cheap. Show me the numbers.

Head-to-head comparisons with open datasets show that MegaMath leads in both quality and scale: MegaMath-Web-Pro outperforms FineMath-4plus by an absolute 4% gain! Plus, MegaMath-Llama-1B and 3B — beat their Base counterparts by up to 20% across 10 benchmarks.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

For the code domain, we applied model-based retrieval to construct a high-quality, math-related code corpus. For synthetic data, we explored the best prompting strategies, improved existing pipelines, and ultimately generated over 60B tokens of synthetic math data.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

❓How we did it? 1/2

For the web domain, we downloaded all Common Crawl, developed an optimized HTML reformatting pipeline, trained a robust FastText-based filter, performed 2-stage extraction, and explored optimal deduplication strategies.

April 11, 2025 at 6:36 PM

Fan Zhou

@fzhou99.bsky.social

🔍 Why MegaMath?

We’ve got numerous general pre-training datasets, but still lack a large math pre-training dataset that scales to >100B tokens. And here you are: MegaMath covers diverse math categories, suitable for pre-training, continual training, mid-training — flexible across various demands.

April 11, 2025 at 6:36 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news