The largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages.
- Long context
- 3T tokens from high-demand domains like legal and science.
- Heavily improves over SoTA
The largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages.
- Long context
- 3T tokens from high-demand domains like legal and science.
- Heavily improves over SoTA