Quentin Lhoest 🤗
banner
lhoestq.hf.co
Quentin Lhoest 🤗
@lhoestq.hf.co
Datasets @ Hugging Face | Open Source + HF Dataset Hub
It also speeds up files downloads and uploads, since now you only need to move the differentiating data around :)

find more about Xet here: huggingface.co/blog/xet-on-...
May 16, 2025 at 3:38 PM
This writer outputs Parquet files that are robust to insertions/deletions/edits

Which means versioned datasets cost only a fraction of their original storage ! 🔥🤯

e.g. if you store with Xet, which deduplicates files by chunk

cc @julien.ledem.net FYI
May 16, 2025 at 3:38 PM
CDC Parquet writer is out in PyArrow nightlies 🔥🔥

$ pip install \
-i pypi.anaconda.org/scientific-p... \
"pyarrow>=21.0.0.dev0"

it's changing the way I view data versioning👇
May 16, 2025 at 3:38 PM