Lightnews — Scholar-powered news

Loubna Ben Allal

@loubnabnl.hf.co

We built code datasets, English datasets, and now it’s time for math! 🚀

Check out Anton’s thread to learn how we curated the best public math pre-training dataset.

Anton @anton-l.bsky.social · Dec 19

Introducing 📐FineMath: the best open math pre-training dataset with 50B+ tokens!

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

🤗 huggingface.co/datasets/Hug...

Here’s a breakdown 🧵

A plot showing increased performance of Llama-3.2-3B when pretrained on FineMath

December 19, 2024 at 4:02 PM

Loubna Ben Allal

@loubnabnl.hf.co

Sharing my slides on "Synthetic data and smol models in 2024" from yesterday's Latent Space event at NeurIPS: docs.google.com/presentation...

- Synthetic data is everywhere
- Model collapse, is the web polluted?
- 3B+ models running on your iPhone
- When and why use smol models?

Synthetic data & Smol models in 2024

Loubna Ben Allal Hugging Face Synthetic data and smol models in 2024 loubnabnl LoubnaBenAllal1

docs.google.com

December 12, 2024 at 2:54 PM

Reposted by Loubna Ben Allal

Tavis Rudd

@tavis.damnsimple.com

Another great talk at @latentspacepod.bsky.social NeurIPS: @loubnabnl.hf.co on Synthetic Data & Smol Models

December 11, 2024 at 10:45 PM

Reposted by Loubna Ben Allal

Ben Burtenshaw

@benburtenshaw.bsky.social

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>

December 3, 2024 at 9:21 AM

Reposted by Loubna Ben Allal

Caleb Fahlgren

@calebfahlgren.hf.co

The amazing, new Qwen2.5-Coder 32B model can now write SQL for any @hf.co dataset ✨

December 2, 2024 at 12:48 PM

Loubna Ben Allal

@loubnabnl.hf.co

We hit 1K ⭐ on our SmolLM repo—thank you! 🎉 New updates:

• SmolLM2 nanotron checkpoints (with optimizer states) for easier continual pre-training
• Local inference demos (MLC, Transformers.js, MLX, llama.cpp)
• SmolVLM: Vision-language model built on SmolLM2

github.com/huggingface/...

December 1, 2024 at 8:00 AM

Loubna Ben Allal

@loubnabnl.hf.co

📬 Summarize and rewrite your text/emails faster, and offline!

Check @andimara.bsky.social's Smol Tools for summarization and rewriting. It uses SmolLM2 to summarize text and make it more friendly or professional, all running locally thanks to llama.cpp github.com/huggingface/...

smollm/smol_tools at main · huggingface/smollm

Everything about the SmolLM & SmolLM2 family of models - huggingface/smollm

github.com

November 30, 2024 at 3:59 PM

Reposted by Loubna Ben Allal

Xenova

@xenova.bsky.social

WOW! 🤯 Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! ⚡️😍

Powered by 🤗 Transformers.js and ONNX Runtime Web!

How many tokens/second do you get? Let me know! 👇

November 27, 2024 at 1:51 PM

Reposted by Loubna Ben Allal

Simon Willison

@simonwillison.net

This demo of structured data extraction running on an LLM that executes entirely in the browser (Chrome only for the moment since it uses WebGPU) is amazing

My notes here: simonwillison.net/2024/Nov/29/...

November 29, 2024 at 9:10 PM

Reposted by Loubna Ben Allal

vb

@reach-vb.hf.co

Fuck it! Structured Generation w/ SmolLM2 running in browser & WebGPU 🔥

Powered by MLC Web-LLM & XGrammar ⚡

Define a JSON schema, Input free text, get structured data right in your browser - profit!!

November 28, 2024 at 10:24 PM

Reposted by Loubna Ben Allal

Elie

@eliebak.hf.co

We’re looking for an intern to join our SmolLM team! If you’re excited about training LLMs and building high-quality datasets, we’d love to hear from you. 🤗

US: apply.workable.com/huggingface/...
EMEA: apply.workable.com/huggingface/...

ML Research Engineer Internship, SmolLMs pretraining and datasets - EMEA Remote - Hugging Face

Here at Hugging Face, we’re on a journey to advance good Machine Learning and make it more accessible. Along the way, we contribute to the development of technology for the better.We have built the fa...

apply.workable.com

November 27, 2024 at 10:20 AM

Reposted by Loubna Ben Allal

Andi

@andimara.bsky.social

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

November 26, 2024 at 3:57 PM

Reposted by Loubna Ben Allal

Anton

@anton-l.bsky.social

Check out how easy it is to do LLM evals with LightEval!

* any dataset on the 🤗 Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything!
* model- and data-parallel inference
* auto batching with the new vLLM backend

A screenshot of LightEval benchmarking results in a terminal

November 25, 2024 at 5:24 PM

Reposted by Loubna Ben Allal

Thomas Wolf

@thomwolf.bsky.social

It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth

By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮

[thread]

November 24, 2024 at 11:05 AM

Loubna Ben Allal

@loubnabnl.hf.co

Making SmolLM2 more reproducible: open-sourcing our training & evaluation toolkit 🛠️ github.com/huggingface/...

Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos

Apache 2.0. V2 data mix coming soon!

Which tools should we add next?

GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

Everything about the SmolLM & SmolLM2 family of models - GitHub - huggingface/smollm: Everything about the SmolLM & SmolLM2 family of models

github.com

November 24, 2024 at 7:16 AM

Reposted by Loubna Ben Allal

Gabriel Martín Blázquez

@gabrielmb.com

Excited to announce the SFT dataset used for @huggingface.bsky.social SmolLM2!

The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.

Check out the dataset:
huggingface.co/datasets/Hug...

HuggingFaceTB/smoltalk · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

November 21, 2024 at 3:22 PM

Reposted by Loubna Ben Allal

Leandro von Werra

@lvwerra.bsky.social

What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5?

Unsurprisingly: data, data, data!

The SmolTalk is open and available here: huggingface.co/datasets/Hug...

November 21, 2024 at 2:17 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news