github.com/huggingface/...
github.com/huggingface/...
We also build a simple clustering pipeline that groups similar embeddings using UMAP and DBSCAN, then generates semantic labels with a VLM.
This helps reveal the types of concepts present in each dataset.
We also build a simple clustering pipeline that groups similar embeddings using UMAP and DBSCAN, then generates semantic labels with a VLM.
This helps reveal the types of concepts present in each dataset.
Our current implementation is already fast, but for even better performance you can utilize optimized libraries like FAISS for the similarity search.
As the number of images increases, similarity search becomes the primary bottleneck instead of the embedding computation.
Our current implementation is already fast, but for even better performance you can utilize optimized libraries like FAISS for the similarity search.
As the number of images increases, similarity search becomes the primary bottleneck instead of the embedding computation.
For each new dataset we add to our training collection, we run the deduplication pipeline that:
> Embeds the new dataset using SSCD
> Computes the cosine similarity between each image and our test set embeddings
> Returns duplicate indices + similarity scores
For each new dataset we add to our training collection, we run the deduplication pipeline that:
> Embeds the new dataset using SSCD
> Computes the cosine similarity between each image and our test set embeddings
> Returns duplicate indices + similarity scores
We indexed all image test datasets from lmms_lab (used by the lmms-eval benchmark) using SSCD (github.com/facebookrese...), a model that generates a descriptor embedding specifically for image copy detection.
We indexed all image test datasets from lmms_lab (used by the lmms-eval benchmark) using SSCD (github.com/facebookrese...), a model that generates a descriptor embedding specifically for image copy detection.
Ideally this should be fast enough to run as a final step before uploading the processed dataset to the Hub.
Ideally this should be fast enough to run as a final step before uploading the processed dataset to the Hub.
When building training datasets, it's critical to ensure that no images from the test sets of any evaluation benchmarks leak in.
When building training datasets, it's critical to ensure that no images from the test sets of any evaluation benchmarks leak in.