Quentin Lhoest 🤗
banner
lhoestq.hf.co
Quentin Lhoest 🤗
@lhoestq.hf.co
Datasets @ Hugging Face | Open Source + HF Dataset Hub
cc @julien.ledem.net the blog post is quite cool imo :)
July 25, 2025 at 4:07 PM
It also speeds up files downloads and uploads, since now you only need to move the differentiating data around :)

find more about Xet here: huggingface.co/blog/xet-on-...
May 16, 2025 at 3:38 PM
This writer outputs Parquet files that are robust to insertions/deletions/edits

Which means versioned datasets cost only a fraction of their original storage ! 🔥🤯

e.g. if you store with Xet, which deduplicates files by chunk

cc @julien.ledem.net FYI
May 16, 2025 at 3:38 PM
you can define it this way:

>>> import pyarrow.parquet as pq
>>> writer = pq.ParquetWriter(
... out, schema,
... use_content_defined_chunking=True,
... )
May 16, 2025 at 3:38 PM
Reposted by Quentin Lhoest 🤗
Spreadsheet folk are welcome on the @hf.co hub too!

@lhoestq.hf.co
https://buff.ly/3VAEYKW
Dataset Spreadsheets - a Hugging Face Space by lhoestq
Discover amazing ML apps made by the community
huggingface.co
December 14, 2024 at 11:00 AM