Lightnews — Scholar-powered news

Vladan Stojnić

@stojnicv.xyz

Are you at @iccv.bsky.social #ICCV2025? Come by our poster.

📅 October 22, 2025, 14:30 – 16:30 HST
📍 Location: Exhibit Hall I, Poster #207

October 21, 2025 at 6:15 PM

Vladan Stojnić

@stojnicv.xyz

I agree that the terminology is confusing. However, I wouldn't agree that CLIP is an SSL method. It uses a contrastive loss, but not with self-supervised labels. DINOv2 and v3 classify it as weakly-supervised as it uses labels coming from the text.

August 18, 2025 at 3:05 PM

Vladan Stojnić

@stojnicv.xyz

The same pattern can be observed for the acquisition parameters in the task of near-duplicate retrieval. If the negatives are captured using the same camera as the query, the task becomes harder for some models compared to the case when they are captured by a different camera.

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

Impact on the semantic performance is again the most pronounced for contrastive VLMs, and the least for SSL models.

Here, we show kNN classification in a few cases, depending on whether the semantic positives and negatives share the same processing parameters as the test image.

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

This impact is especially pronounced when there is a strong correlation/anticorrelation between the semantic and metadata labels. E.g., when semantic positives/negatives have the same/different processing parameters as a query image.

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

More strikingly, we show that traces of these metadata labels (processing and acquisition parameters) can significantly impact the semantic recognition abilities.

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

A similar pattern is observed for the acquisition parameters, although generally, all models have a harder time predicting these parameters than the processing ones.

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

Image processing parameters can be recovered from the representations of frozen models by training a linear layer on top. This ability is especially pronounced for contrastive VLMs (e.g., CLIP). Some supervised models perform strongly as well, while SSL models perform the worst.

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

Have you ever asked yourself how much your favorite vision model knows about image capture parameters (e.g., the amount of JPEG compression, the camera model, etc.)? Furthermore, could these parameters influence its semantic recognition abilities?

August 18, 2025 at 10:48 AM

Vladan Stojnić

@stojnicv.xyz

Are you at @cvprconference.bsky.social #CVPR2025 ? Come and check out LPOSS.

We show how can graph-based label propagation be used to improve weak, patch-level predictions from VLMs for open-vocabulary semantic segmentation.

📅 June 13, 2025, 16:00 – 18:00 CDT
📍 Location: ExHall D, Poster #421

June 13, 2025 at 12:03 PM

Vladan Stojnić

@stojnicv.xyz

I agree. However, be careful especially if you go to higher downsampling factors. Those numbers are a bit too optimistic, as it can be very dataset dependent. They were reported on PASCAL which mostly has images with one/two objects on the background.

April 14, 2025 at 11:54 AM

Vladan Stojnić

@stojnicv.xyz

Be sure to check out the demo on @hf.co spaces and play with it. 🙂

huggingface.co/spaces/stojn...

March 27, 2025 at 2:02 PM

Vladan Stojnić

@stojnicv.xyz

They obtain especially strong performance around object boundaries as measured by the Boundary IoU metric.

March 27, 2025 at 2:02 PM

Vladan Stojnić

@stojnicv.xyz

LPOSS and LPOSS+ obtain a strong performance across a wide range of different benchmarks in terms of the commonly used mIoU metric.

March 27, 2025 at 2:02 PM

Vladan Stojnić

@stojnicv.xyz

LPOSS and LPOSS+, perform inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image.

March 27, 2025 at 2:02 PM

Vladan Stojnić

@stojnicv.xyz

We are happy to share LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation.

LPOSS is a training-free method for open-vocabulary semantic segmentation using Vision-Language Models.

March 27, 2025 at 2:02 PM

Vladan Stojnić

@stojnicv.xyz

SigLIP could be an interesting choice. However, note that in ILIAS we observe that DINOv2 mostly outperforms SigLIP on landmarks, which is probably our closest category to images in localization

February 26, 2025 at 6:06 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news