Lightnews — Scholar-powered news

Luis

@lusxvr.bsky.social

We're open-sourcing this simple pipeline, check out the repo for the full implementation. It’s designed to be plug-and-play with any Hugging Face dataset.

github.com/huggingface/...

GitHub - huggingface/large-scale-image-deduplication

Contribute to huggingface/large-scale-image-deduplication development by creating an account on GitHub.

github.com

July 2, 2025 at 2:08 PM

Luis

@lusxvr.bsky.social

Bonus - Clustering & Labeling:

We also build a simple clustering pipeline that groups similar embeddings using UMAP and DBSCAN, then generates semantic labels with a VLM.

This helps reveal the types of concepts present in each dataset.

July 2, 2025 at 2:08 PM

Luis

@lusxvr.bsky.social

Performance:

Our current implementation is already fast, but for even better performance you can utilize optimized libraries like FAISS for the similarity search.

As the number of images increases, similarity search becomes the primary bottleneck instead of the embedding computation.

July 2, 2025 at 2:08 PM

Luis

@lusxvr.bsky.social

Step 2 - Fast Deduplication:

For each new dataset we add to our training collection, we run the deduplication pipeline that:
> Embeds the new dataset using SSCD
> Computes the cosine similarity between each image and our test set embeddings
> Returns duplicate indices + similarity scores

July 2, 2025 at 2:08 PM

Luis

@lusxvr.bsky.social

This results in approximately 700,000 embeddings across 66 datasets, which gives us a comprehensive reference set to check for duplicates.

July 2, 2025 at 2:08 PM

Luis

@lusxvr.bsky.social

Step 1 – Test Set Indexing:

We indexed all image test datasets from lmms_lab (used by the lmms-eval benchmark) using SSCD (github.com/facebookrese...), a model that generates a descriptor embedding specifically for image copy detection.

GitHub - facebookresearch/sscd-copy-detection: Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).

Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD). - facebookresearch/sscd-copy-detection

github.com

July 2, 2025 at 2:08 PM

Luis

@lusxvr.bsky.social

Since manually checking millions of images is impossible, we needed an automated pipeline.

Ideally this should be fast enough to run as a final step before uploading the processed dataset to the Hub.

July 2, 2025 at 2:08 PM