David Berenstein
banner
davidberenstein.bsky.social
David Berenstein
@davidberenstein.bsky.social
ML & DevRel @ Giskard & Pruna | ex HF 🤗 | 👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing
🔥 Bespoke curator: Synthetic Data Curation for Post-Training & Structured Data Extraction

Create synthetic data pipelines with easy!
- Retries and caching included
- inference via LiteLLM, vLLM, and popular batch APIs
- asynchronous operations

🔗 URL: buff.ly/ajPRT1l
April 10, 2025 at 12:00 PM
🔥One > token > at > a > time < a < at < token < One 🔥

token-explorer is a simple tool that lets you explore different possible paths that an LLM might sample!

- Arrow keys to navigate, pop and append tokens
- View the token probabilities and entropies.

GitHub: buff.ly/FQgsczM
April 3, 2025 at 12:22 PM
🔥 The smolagents module has arrived in the agents course!

💻 Code agents optimised for software development
🔧 Tool calling agents that create modular, function-driven workflows
🔍 Retrieval agents designed to access and synthesise information

Course: https://buff.ly/4kcj6Ai
February 25, 2025 at 3:40 PM
🧑‍🏫 Awesome. My talk for PyCon Italy 2025 got accepted!

Got data problems? Relax. Synthetic data is here to help.

Talk: https://buff.ly/3QzoZKj
February 25, 2025 at 8:54 AM
🐳 Announcing docker support to Quickly set up your Synthetic Data Generator with (Gradio + Ollama + Argilla)!

🔥 Build genuinely useful datasets using natural language!

⚖️ Scale however you need.

🔐 Use them privately or share them with the world!

🧑‍💻 GitHub: https://buff.ly/49IDSmd
February 21, 2025 at 8:00 AM
Image Generation has landed in Arena form 🎨🤖!

1. Describe your desired image🎨
2. Two anonymous models output images
3. Vote for the winner!

Images have been sourced from our Open Image Preference dataset!
Dataset: https://buff.ly/4il0du9
Arena: https://buff.ly/4142NwH
February 19, 2025 at 11:05 AM
Are you, the top of the Agents class?!

We just released a bonus unit on function calling (FC).

You will learn:
⑴ What is FC?
⑵ Thought → Act → Observe Cycle in FC
⑶ lightweight and efficient fine-tuning

Course: https://buff.ly/3Qn1DHB
February 18, 2025 at 4:14 PM
🚀 Find banger tools for your smolagents!

I created the Tools gallery, which makes tools specifically developed by/for smolagents searchable and visible. This will help with:
- inspiration
- best practices
- finding cool tools

Space: https://buff.ly/41cYctx
February 12, 2025 at 9:15 AM
🔥 Come and get those AI agents certificates!

Join the cohort of 66K students: https://buff.ly/4hxb6rK
February 10, 2025 at 2:38 PM
Documents or images to structured data using Vision Language Models

Outlines has an integration with transformers, which facilitates structured generation based on limiting token sampling probabilities.

Blog: https://buff.ly/4jFHMkr
February 10, 2025 at 1:00 PM
Local docker deployments for the synthetic data generator 🫱🏾‍🫲🏼

We would love to hear your thoughts!

PR: https://buff.ly/4hRMny6
February 10, 2025 at 10:13 AM
Curious about "Why 🚀", you may wonder?

smolagents effortlessness combined with the power of 400,000 AI tools available on the Hub!

library: https://buff.ly/4hj6PrJ
February 7, 2025 at 12:14 PM
WOW, this will rock the world! Hibiki is a model for simultaneous speech2speech translation.

And it actually works.

Available in French-English but super excited to see what the community will do.

Hub: https://buff.ly/3EtmM0f
Paper: https://buff.ly/4jIXNGd
February 6, 2025 at 3:06 PM
Anyone can create free hosted tools for their AI agents! 🔥

Agentic RAG stack part 2 - augment
Augment retrieval results by reranking optimises content without increasing time too much

part2: https://buff.ly/40HkB0x
part1: https://buff.ly/40XNIxM
code: https://buff.ly/4hEajpj
February 5, 2025 at 10:11 AM
Shit! 24B is the new small.

Mistral drops their new model on Hugging Face!

Great performance, and low latency.

Model: https://buff.ly/4hwAzBa
Code: https://buff.ly/3CEohrF
January 30, 2025 at 5:10 PM
Deploy a DeepSeek Web App with minimal code!

AI Gradio is a Python package that makes it easy for developers to create AI apps powered by various AI providers.

Code: https://buff.ly/40BDsde
Library: https://buff.ly/3CvOQ2n
January 30, 2025 at 4:44 PM
No data for fine-tuning retrieval models?

We help you generate it!

- Load from Hub
- Upload your own files
- Generate from a prompt

Space: https://buff.ly/3Y1S99z
Code: https://buff.ly/3PRg4TX
January 30, 2025 at 12:37 PM
⚡️ Embed 1 million records in <10 minutes

Load the data, use static embeddings, and reupload.

Ready for vector search but might require some reranking.

Library: https://buff.ly/42miwte
January 29, 2025 at 1:00 PM
🔥 The synthetic data for SmolLM and open DeepSeek-R1 relies on this awesome package!

1.2K distilabel datasets on the Hub https://buff.ly/3PW46si
reproducible and sharable pipelines
any LLM provider
scale however you want

library: https://buff.ly/3MXAB8G
January 29, 2025 at 8:06 AM
Today, we are launching the integration of four awesome serverless Inference Providers – fal, Replicate, Sambanova, Together AI!

Want to know how it works?

Read the blog: https://buff.ly/3CreCES
January 28, 2025 at 2:29 PM
🐳 DeepSeek is on Hugging Face 🤗

Free for inference!
1K requests for free
20K requests with PRO

Code: https://buff.ly/4glAAa5
900 models more: https://buff.ly/40x1rua
January 28, 2025 at 1:00 PM
Let's uncover the post-training dataset from Deepseek-R1 with Magpie!

Pass pre-query tokens `<|begin▁of▁sentence|>User: `, let the model generate the rest.

We get realistic examples!

Gist: https://buff.ly/40nPHu0
Library: https://buff.ly/3MXAB8G
January 27, 2025 at 8:35 AM
You might have thought VLMs could not get smoller?

🐁 Hugging Face proves you wrong and launches SmolVLM 256M & 500M. You can fine-tune it on your laptop and run it on your toaster!
👇
🐘 Beats SOTA 80B from less than 2 years ago!

Model: https://buff.ly/4g9bGur
January 23, 2025 at 12:34 PM
ColPali and VLMs are great for multi-modal RAG with truly effective document retrieval.

Want to set up this pipeline yourself?

Read the blog: https://buff.ly/42rNPTG
January 23, 2025 at 10:41 AM
For a while, companies have been showing off their AI competence on the Hub with their datasets, models, and Spaces.

Now, you can do the same with more nuance by linking blogs to your organisation!

blog: https://buff.ly/3C3IzLe
January 22, 2025 at 1:00 PM