Head-to-head comparisons with open datasets show that MegaMath leads in both quality and scale: MegaMath-Web-Pro outperforms FineMath-4plus by an absolute 4% gain! Plus, MegaMath-Llama-1B and 3B — beat their Base counterparts by up to 20% across 10 benchmarks.
Head-to-head comparisons with open datasets show that MegaMath leads in both quality and scale: MegaMath-Web-Pro outperforms FineMath-4plus by an absolute 4% gain! Plus, MegaMath-Llama-1B and 3B — beat their Base counterparts by up to 20% across 10 benchmarks.
For the web domain, we downloaded all Common Crawl, developed an optimized HTML reformatting pipeline, trained a robust FastText-based filter, performed 2-stage extraction, and explored optimal deduplication strategies.
For the web domain, we downloaded all Common Crawl, developed an optimized HTML reformatting pipeline, trained a robust FastText-based filter, performed 2-stage extraction, and explored optimal deduplication strategies.
We’ve got numerous general pre-training datasets, but still lack a large math pre-training dataset that scales to >100B tokens. And here you are: MegaMath covers diverse math categories, suitable for pre-training, continual training, mid-training — flexible across various demands.
We’ve got numerous general pre-training datasets, but still lack a large math pre-training dataset that scales to >100B tokens. And here you are: MegaMath covers diverse math categories, suitable for pre-training, continual training, mid-training — flexible across various demands.
Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!
Happy to share our latest efforts on math pre-training data, the MegaMath dataset! This is a 9-month project starting from 2024’s summer, and we finally deliver: the largest math pre-training data to date containing 💥370B 💥tokens of web, code, and synthetic data!