Training SmolLMs and curating high quality web and synthetic datasets ✨
https://loubnabnl.github.io/
Check out Anton’s thread to learn how we curated the best public math pre-training dataset.
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
🤗 huggingface.co/datasets/Hug...
Here’s a breakdown 🧵
Check out Anton’s thread to learn how we curated the best public math pre-training dataset.
- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?
- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?
🧵>>
🧵>>
• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2
github.com/huggingface/...
• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2
github.com/huggingface/...
Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...
Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...
Powered by 🤗 Transformers.js and ONNX Runtime Web!
How many tokens/second do you get? Let me know! 👇
Powered by 🤗 Transformers.js and ONNX Runtime Web!
How many tokens/second do you get? Let me know! 👇
My notes here: simonwillison.net/2024/Nov/29/...
My notes here: simonwillison.net/2024/Nov/29/...
Powered by MLC Web-LLM & XGrammar ⚡
Define a JSON schema, Input free text, get structured data right in your browser - profit!!
Powered by MLC Web-LLM & XGrammar ⚡
Define a JSON schema, Input free text, get structured data right in your browser - profit!!
US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...
US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...
SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!
* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend
By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮
[thread]
By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮
[thread]
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.
Check out the dataset:
huggingface.co/datasets/Hug...
The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.
Check out the dataset:
huggingface.co/datasets/Hug...
Unsurprisingly: data, data, data!
The SmolTalk is open and available here: huggingface.co/datasets/Hug...
Unsurprisingly: data, data, data!
The SmolTalk is open and available here: huggingface.co/datasets/Hug...