the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens arxiv.org/pdf/2504.07491
the paper introduces an interesting pre-training pipeline to handle long context and the model saw 4.4T tokens arxiv.org/pdf/2504.07491
• 256M delivers 80% of the performance of our 2.2B model.
• 500M hits 90%.
Both beat our SOTA 80B model from 17 months ago! 🎉
Efficiency 🤝 Performance
Explore the collection here: huggingface.co/collections/...
Blog: huggingface.co/blog/smolervlm
• 256M delivers 80% of the performance of our 2.2B model.
• 500M hits 90%.
Both beat our SOTA 80B model from 17 months ago! 🎉
Efficiency 🤝 Performance
Explore the collection here: huggingface.co/collections/...
Blog: huggingface.co/blog/smolervlm