Quentin Lhoest 🤗
banner
lhoestq.hf.co
Quentin Lhoest 🤗
@lhoestq.hf.co
Datasets @ Hugging Face | Open Source + HF Dataset Hub
New blog post 🚨 Every data engineer should read it

@kszucs.bsky.social (Apache Arrow PMC member) announces how to drastically speed up Parquet files uploads and downloads via deduplication.

Best part: the feature enabling this is open source !
huggingface.co/blog/parquet...
Parquet Content-Defined Chunking
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
July 25, 2025 at 4:06 PM
CDC Parquet writer is out in PyArrow nightlies 🔥🔥

$ pip install \
-i pypi.anaconda.org/scientific-p... \
"pyarrow>=21.0.0.dev0"

it's changing the way I view data versioning👇
May 16, 2025 at 3:38 PM
Reposted by Quentin Lhoest 🤗
🤖 We are thrilled to announce AgiBot World, the first large-scale robotic learning dataset designed to advance multi-purpose humanoid policies!

Github:
github.com/OpenDriveLab...

HuggingFace:
huggingface.co/agibot-world
December 30, 2024 at 10:48 AM
Reposted by Quentin Lhoest 🤗
SuperCharged Euclid is on 🤗 Hugging Face

Also, this is the best paper heading I’ve seen in quite some time. The 'en tête' looks fantastic.

(⚡Llama 3.3) Chat with the paper: huggingface.co/spaces/hugg...
🤗 Model: huggingface.co/euclid-mult...
🤗 Dataset: huggingface.co/datasets/eu...
December 13, 2024 at 5:51 PM
Reposted by Quentin Lhoest 🤗
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥

How? By combining step-wise reward models with tree search algorithms :)

We're open sourcing the full recipe and sharing a detailed blog post 👇
December 16, 2024 at 5:08 PM
Reposted by Quentin Lhoest 🤗
In-place Assistants > Chat windows!
Hugging Face's integration of an "AI Query" overlay in their SQL console exemplifies this. Users input natural language, AI suggests SQL queries—streamlining data exploration seamlessly. Probably the best showcase of this pattern in a freely accessible product.
December 5, 2024 at 11:20 AM
Reposted by Quentin Lhoest 🤗
Spreadsheet folk are welcome on the @hf.co hub too!

@lhoestq.hf.co
https://buff.ly/3VAEYKW
Dataset Spreadsheets - a Hugging Face Space by lhoestq
Discover amazing ML apps made by the community
huggingface.co
December 14, 2024 at 11:00 AM
Reposted by Quentin Lhoest 🤗
🚀 Introducing INCLUDE 🌍: A multilingual LLM evaluation benchmark spanning 44 languages!

Contains *newly-collected* data, prioritizing *regional knowledge*.
Setting the stage for truly global AI evaluation.
Ready to see how your model measures up?
#AI #Multilingual #LLM #NLProc
December 2, 2024 at 3:53 PM
Reposted by Quentin Lhoest 🤗
The AT Protocol unlocks exciting possibilities:
- Building custom feeds using ML
- Creating dashboards for data exploration
- Developing custom models for Bluesky
To gather @bsky.app resources on @huggingface.bsky.social. I've established a community org 🤗 huggingface.co/bluesky-comm...
bluesky-community (Bluesky Community)
Tools for Bluesky 🦋
huggingface.co
November 25, 2024 at 3:59 PM
Reposted by Quentin Lhoest 🤗
First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts 🦋

📊 1M public posts from Bluesky's firehose API
🔍 Includes text, metadata, and language predictions
🔬 Perfect to experiment with using ML for Bluesky 🤗

huggingface.co/datasets/blu...
bluesky-community/one-million-bluesky-posts · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
November 26, 2024 at 1:50 PM
Reposted by Quentin Lhoest 🤗
NuminaMath dataset now under Apache 2.0 license.
AI-MO/NuminaMath-CoT · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
November 25, 2024 at 9:31 AM
Reposted by Quentin Lhoest 🤗
Open Source Post Training is going strong! In last 2 weeks, we got data or recipes released for OpenCoder, SmolLM-2, Orca Agent Instruct, and Tülu 3. Read it, learn, and iterate:
November 23, 2024 at 7:45 AM