Lightnews — Scholar-powered news

Reposted by Daniel Vila

Florent Daudens

@fdaudens.bsky.social

🚀 The open source community is unstoppable: 4M total downloads for DeepSeek models on @hf.co , with 3.2M coming from the +600 models created by the community. That's 30% more than yesterday!

January 28, 2025 at 5:55 PM

Reposted by Daniel Vila

Sara Han

@sdiazlor.hf.co

💫 Generate RAG data with the Synthetic Data Generator to improve your RAG system!

1️⃣ Generate from your documents, dataset, or dataset description.
2️⃣ Configure it.
3️⃣ Generate the synthetic dataset.
4️⃣ Fine-tune the retrieval and reranking models.
5️⃣ Build a RAG pipeline.

January 20, 2025 at 4:42 PM

Reposted by Daniel Vila

Natalia

@nataliaelv.hf.co

New chapter in the Hugging Face NLP course! 🤗 🚀

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.

Any feedback for improvements welcome!

Screenshot of the Introduction to Argilla in Chapter 10 of the Hugging Face NLP course

January 17, 2025 at 10:02 AM

Reposted by Daniel Vila

Daniel van Strien

@danielvanstrien.bsky.social

🎉 50,000+ annotations reached! The FineWeb2-C community is helping build better language models on annotation at a time.

📊 Current stats:
- 115 languages represented
- 419 amazing contributors
- 24 languages with complete datasets

But we're not done yet! 🧵

Screenshot of this text: Total annotations submitted: 50,035 Languages with annotations: 115 Total contributors: 419

January 16, 2025 at 5:32 PM

Reposted by Daniel Vila

David Berenstein

@davidberenstein.bsky.social

High-quality data for fine-tuning language models for free and at the click of a button!

Prompt and wait for your dataset to push to Argilla or the Hub
Evaluate, review and fine-tune a model.

Blog:

Fine-tune a SmolLM on domain-specific synthetic data from a LLM

A Blog post by David Berenstein on Hugging Face

buff.ly

January 7, 2025 at 1:00 PM

Reposted by Daniel Vila

Daniel van Strien

@danielvanstrien.bsky.social

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

January 3, 2025 at 12:00 PM

Reposted by Daniel Vila

Daniel van Strien

@danielvanstrien.bsky.social

The finish line is near! We're building FineWeb-Edu for many languages and need your help 🤗

Many FineWeb-C languages are close to 1,000 annotations!

Assamese is 99.4% done, French needs 64 more annotations, Tamil: 216.

Please help us reach the goal: huggingface.co/spaces/data-...

Progress bars showing remaining annotations needed for 15 languages in FineWeb-C dataset, ranging from 6 to 593 annotations needed

January 6, 2025 at 2:32 PM

Daniel Vila

@dvilasuero.hf.co

💥 Ending 2024: A full data annotation journey on the Hugging Face Hub—from raw data to training-ready datasets!

With Argilla 2.6.0, push your data to the Hub from the UI

Let’s make 2025 the year anyone can build more transparent and accountable AI—no coding or model skills needed.

December 20, 2024 at 11:14 AM

Reposted by Daniel Vila

José Francisco Calvo

@jfcalvo.hf.co

🚀 Argilla v2.6.0 is here! 🎉

Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. 🤩

Take a look to this quick demo 👇

💁‍♂️ More info about the release at github.com/argilla-io/a...

#AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla

December 19, 2024 at 12:39 PM

Reposted by Daniel Vila

David Berenstein

@davidberenstein.bsky.social

🔥 We got great feedback on this: "Synthetic Data Generator"

A no-code tool to create datasets with LLMs, making it a breeze, allowing ANYONE to create datasets and models in minutes and without any code.

Blog: https://buff.ly/4gybyoT
GitHub: https://buff.ly/49IDSmd
Space: https://buff.ly/3Y1S99z

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

buff.ly

December 17, 2024 at 7:18 AM

Reposted by Daniel Vila

Ashvanth.S

@ashvanths.bsky.social

Well, around 10 percent of the initial goal is complete, and so far, it's been quite a one-man army effort. We're still in the hunt for more people to join and contribute to this open-source initiative.

@hf.co

data-is-better-together-fineweb-c.hf.space/share-your-p...

tam - தமிழ் - Tamil

Join and contribute to the dataset tam - தமிழ் - Tamil

data-is-better-together-fineweb-c.hf.space

December 14, 2024 at 7:33 AM

Reposted by Daniel Vila

Johannes

@johko.bsky.social

The sprint for crowd sourced annotations with argilla is in full swing over at data-is-better-together-fineweb-c.hf.space

I've just contributed 100 examples to this dataset:
data-is-better-together-fineweb-c.hf.space/share-your-p...

Big thanks to @dvilasuero.hf.co, @nataliaelv.hf.co and team 🙌

nds - Neddersass’sch - Low German

Join and contribute to the dataset nds - Neddersass’sch - Low German

data-is-better-together-fineweb-c.hf.space

December 13, 2024 at 7:38 AM

Reposted by Daniel Vila

Moritz Laurer

@moritzlaurer.bsky.social

I've been building a small library for working with prompt templates on the @huggingface.bsky.social Hub: `pip install prompt-templates`. Motivation:

The community currently shares prompt templates in a wide variety of formats: in datasets, in model cards, as strings in .py files, as .txt/... 🧵

December 12, 2024 at 3:58 PM

Reposted by Daniel Vila

Ben Burtenshaw

@benburtenshaw.bsky.social

Desperate to contribute to the development of Scots language AI. I've just contributed 16 examples to this dataset:

data-is-better-together-fineweb-c.hf.space/share-your-p...

sco - Scots - Scots

Join and contribute to the dataset sco - Scots - Scots

data-is-better-together-fineweb-c.hf.space

December 12, 2024 at 1:44 PM

Daniel Vila

@dvilasuero.hf.co

I've just contributed 156 examples to the FineWeb 2 Spanish dataset:

data-is-better-together-fineweb-c.hf.space/share-your-p...

If you want to contribute, sign in with @hf.co and find your language

spa - español - Spanish

Join and contribute to the dataset spa - español - Spanish

data-is-better-together-fineweb-c.hf.space

December 12, 2024 at 1:24 PM

Daniel Vila

@dvilasuero.hf.co

Help shape the future of multilingual Open Source AI!

Join the FineWeb 2 Community Annotation Sprint to create an open training dataset with full transparency and human validation in many languages.

Review datasets in your language and help identify the best sources for training.

December 10, 2024 at 2:12 PM

Reposted by Daniel Vila

frascuchon.bsky.social

@frascuchon.bsky.social

✨ Argilla 2.5.0 is live and it comes with webhook listener support to supercharge your workflows! 🚀

#AI #MachineLearning #Webhooks #TechUpdate

December 3, 2024 at 10:46 AM

Reposted by Daniel Vila

David Berenstein

@davidberenstein.bsky.social

👐 Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation by the @hf.co community. This dataset contains 10K text-to-image preference pairs across image generation categories, using different model families and prompt complexities.

Blog: huggingface.co/blog/image-p...

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

December 9, 2024 at 3:30 PM

Reposted by Daniel Vila

Sara Han

@sdiazlor.hf.co

Open Image Preferences released! 🚀

- Open-source dataset for text2image
- 10K samples manually evaluated by the HF community.
- Binarized format for SFT, DPO, or ORPO.

It comes with a nice blog post explaining the steps to pre-process and generate the data, along with the results.

December 9, 2024 at 4:26 PM

Daniel Vila

@dvilasuero.hf.co

Announcing Global-MMLU - an improved MMLU Open dataset with evaluation coverage across 42 languages.

The result of months of work with the goal of advancing Multilingual LLM evaluation.

Built together with the community and amazing collaborators at Cohere4AI, MILA, MIT, and many more.

December 6, 2024 at 8:59 AM

Daniel Vila

@dvilasuero.hf.co

We're about to launch the biggest collaboration effort since the Open Assistant.

Let's get the highest quality data for open foundation models with all the nuances & diversity of each language, all with data provenance and transparency

Join us as language lead:
docs.google.com/forms/d/10XI...

Language Lead sign-up

At Hugging Face 🤗, we're launching a big community initiative to improve LLM training for many languages. We're looking for Language Leads to help us cultivate specific languages during this initiativ...

docs.google.com

December 3, 2024 at 4:53 PM

Reposted by Daniel Vila

Natalia

@nataliaelv.hf.co

Next week we're launching a collaborative annotation effort to build a big multilingual dataset, so you can have high-quality data in your language.

We are really close to getting leads for 100 languages! Can you help us cover the remaining 200?

Screenshot of a dashboard showing the number of languages with a lead and languages without a lead

December 3, 2024 at 12:45 PM

Reposted by Daniel Vila

Ben Burtenshaw

@benburtenshaw.bsky.social

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>

December 3, 2024 at 9:21 AM

Reposted by Daniel Vila

José Francisco Calvo

@jfcalvo.hf.co

🙌 I just wanted to share a few thoughts about the latest Argilla release, 2.5.0, as it's a pretty big one!

Argilla now has full support for webhooks, which means you can do some pretty cool stuff, like model training on the fly as annotations are created. 🤯

#MachineLearning #NLP #DataLabeling

December 2, 2024 at 11:14 AM

Reposted by Daniel Vila

Ben Burtenshaw

@benburtenshaw.bsky.social

[SATURDAY THREAD] ☕️ 🧑‍🎓

In case you spent the week reading GDPR legislation and missed everything. It’s all about vision language models and image preference datasets.

>> 🧵 Here are the models and datasets you can use in your projects.

November 30, 2024 at 7:40 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news