Luis
lusxvr.bsky.social
Luis
@lusxvr.bsky.social
Research @HuggingFace | CS @ TUM
We're open-sourcing this simple pipeline, check out the repo for the full implementation. It’s designed to be plug-and-play with any Hugging Face dataset.

github.com/huggingface/...
GitHub - huggingface/large-scale-image-deduplication
Contribute to huggingface/large-scale-image-deduplication development by creating an account on GitHub.
github.com
July 2, 2025 at 2:08 PM
Bonus - Clustering & Labeling:

We also build a simple clustering pipeline that groups similar embeddings using UMAP and DBSCAN, then generates semantic labels with a VLM.

This helps reveal the types of concepts present in each dataset.
July 2, 2025 at 2:08 PM
Performance:

Our current implementation is already fast, but for even better performance you can utilize optimized libraries like FAISS for the similarity search.

As the number of images increases, similarity search becomes the primary bottleneck instead of the embedding computation.
July 2, 2025 at 2:08 PM
Step 2 - Fast Deduplication:

For each new dataset we add to our training collection, we run the deduplication pipeline that:
> Embeds the new dataset using SSCD
> Computes the cosine similarity between each image and our test set embeddings
> Returns duplicate indices + similarity scores
July 2, 2025 at 2:08 PM
This results in approximately 700,000 embeddings across 66 datasets, which gives us a comprehensive reference set to check for duplicates.
July 2, 2025 at 2:08 PM
Step 1 – Test Set Indexing:

We indexed all image test datasets from lmms_lab (used by the lmms-eval benchmark) using SSCD (github.com/facebookrese...), a model that generates a descriptor embedding specifically for image copy detection.
GitHub - facebookresearch/sscd-copy-detection: Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).
Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD). - facebookresearch/sscd-copy-detection
github.com
July 2, 2025 at 2:08 PM
Since manually checking millions of images is impossible, we needed an automated pipeline.

Ideally this should be fast enough to run as a final step before uploading the processed dataset to the Hub.
July 2, 2025 at 2:08 PM
The Problem:

When building training datasets, it's critical to ensure that no images from the test sets of any evaluation benchmarks leak in.
July 2, 2025 at 2:08 PM