I'm working on ML and advocacy. Till now, I have focused on building Argilla and distilabel. But I'll also share updates and cool stuff from the AI community, tools, or notebooks.
P.S. Maybe a bit of my 🐕 and 🎮 too!
1️⃣ Generate from your documents, dataset, or dataset description.
2️⃣ Configure it.
3️⃣ Generate the synthetic dataset.
4️⃣ Fine-tune the retrieval and reranking models.
5️⃣ Build a RAG pipeline.
1️⃣ Generate from your documents, dataset, or dataset description.
2️⃣ Configure it.
3️⃣ Generate the synthetic dataset.
4️⃣ Fine-tune the retrieval and reranking models.
5️⃣ Build a RAG pipeline.
Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. 🤩
Take a look to this quick demo 👇
💁♂️ More info about the release at github.com/argilla-io/a...
#AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla
Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. 🤩
Take a look to this quick demo 👇
💁♂️ More info about the release at github.com/argilla-io/a...
#AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla
1️⃣ Use the Synthetic Data Generator to create your custom dataset
2️⃣ Use AutoTrain to use the generated dataset and train your model
Check it here: huggingface.co/blog/synthet...
1️⃣ Use the Synthetic Data Generator to create your custom dataset
2️⃣ Use AutoTrain to use the generated dataset and train your model
Check it here: huggingface.co/blog/synthet...
Want to see how it works? Watch this quick video (www.youtube.com/watch?v=nXjV...) and get started here: t.co/hJ1b2TsMq0
Want to see how it works? Watch this quick video (www.youtube.com/watch?v=nXjV...) and get started here: t.co/hJ1b2TsMq0
data-is-better-together-fineweb-c.hf.space/share-your-p...
data-is-better-together-fineweb-c.hf.space/share-your-p...
💫 Join to build an impactful dataset for your language!
💫 Join to build an impactful dataset for your language!
- Open-source dataset for text2image
- 10K samples manually evaluated by the HF community.
- Binarized format for SFT, DPO, or ORPO.
It comes with a nice blog post explaining the steps to pre-process and generate the data, along with the results.
- Open-source dataset for text2image
- 10K samples manually evaluated by the HF community.
- Binarized format for SFT, DPO, or ORPO.
It comes with a nice blog post explaining the steps to pre-process and generate the data, along with the results.
If there's already a Language Lead, stay tuned! Is this the start of a nice community?
docs.google.com/forms/d/e/1F...
If there's already a Language Lead, stay tuned! Is this the start of a nice community?
docs.google.com/forms/d/e/1F...
✨Use the latest Argilla features
✨Improve human-in-the-loop workflows
✨Manage datasets, track progress, and coordinate your annotation team
✨Use the latest Argilla features
✨Improve human-in-the-loop workflows
✨Manage datasets, track progress, and coordinate your annotation team
> The results are very promising, beating o1-mini.
> However, they also have several limitations you might notice even in the demo (I found endless reasoning trying to find out the number of 'r' in 🍓). So, let's see how they deal with them.
> The results are very promising, beating o1-mini.
> However, they also have several limitations you might notice even in the demo (I found endless reasoning trying to find out the number of 'r' in 🍓). So, let's see how they deal with them.
Hugging Face empowers everyone to use AI to create value and is against monopolization of AI it's a hosting platform above all.
Hugging Face empowers everyone to use AI to create value and is against monopolization of AI it's a hosting platform above all.
Why this is a huge deal? Llama.cpp is well-known for running very well on CPU. If you're running small models like Llama 1B or embedding models, this will definitely save tons of money 💰 💰
Why this is a huge deal? Llama.cpp is well-known for running very well on CPU. If you're running small models like Llama 1B or embedding models, this will definitely save tons of money 💰 💰
> Goal: Release an open-source image dataset, enabling the entire community to benefit from it.
> Requirements: All you need is a Hugging Face account and a willingness to contribute.
More in 🧵
> Goal: Release an open-source image dataset, enabling the entire community to benefit from it.
> Requirements: All you need is a Hugging Face account and a willingness to contribute.
More in 🧵
At @huggingface.bsky.social we'll launch a huge community sprint soon to build high-quality training datasets for many languages.
We're looking for Language Leads to help with outreach.
Find your language and nominate yourself:
forms.gle/iAJVauUQ3FN8...
At @huggingface.bsky.social we'll launch a huge community sprint soon to build high-quality training datasets for many languages.
We're looking for Language Leads to help with outreach.
Find your language and nominate yourself:
forms.gle/iAJVauUQ3FN8...
So much of AI is based on exploiting workers in precarious conditions 😔
www.cbsnews.com/news/labeler...
So much of AI is based on exploiting workers in precarious conditions 😔
www.cbsnews.com/news/labeler...
Thanks to everyone for the support! We'll continue shipping new updates for you to curate your data easily ✍️ And, for sure, your feedback is more than welcome 🙌
Thanks to everyone for the support! We'll continue shipping new updates for you to curate your data easily ✍️ And, for sure, your feedback is more than welcome 🙌
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
Pre-training & evaluation code, synthetic data generation pipelines, post-training scripts, on-device tools & demos
Apache 2.0. V2 data mix coming soon!
Which tools should we add next?
.
.
.
Yes, you're right! You can now add your Bluesky account and also check your recent activity 🙌
.
.
.
Yes, you're right! You can now add your Bluesky account and also check your recent activity 🙌
As @natolambert.bsky.social says, it sets the next era in open post-training.
My highlight? the data generation & open datasets
Want to deep dive into the data? Here's an Argilla @huggingface.bsky.social Space
huggingface.co/spaces/argil...
It uses a mix of public and synthetic data, including Magpie Ultra using distilabel 🚀
huggingface.co/datasets/Hug...
It uses a mix of public and synthetic data, including Magpie Ultra using distilabel 🚀
huggingface.co/datasets/Hug...
The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.
Check out the dataset:
huggingface.co/datasets/Hug...
The dataset for SmolLM2 was created by combining multiple existing datasets and generating new synthetic datasets, including MagPie Ultra v1.0, using distilabel.
Check out the dataset:
huggingface.co/datasets/Hug...
I'm working on ML and advocacy. Till now, I have focused on building Argilla and distilabel. But I'll also share updates and cool stuff from the AI community, tools, or notebooks.
P.S. Maybe a bit of my 🐕 and 🎮 too!
I'm working on ML and advocacy. Till now, I have focused on building Argilla and distilabel. But I'll also share updates and cool stuff from the AI community, tools, or notebooks.
P.S. Maybe a bit of my 🐕 and 🎮 too!