Vladan Stojnić
stojnicv.xyz
Vladan Stojnić
@stojnicv.xyz
Ph.D. student at Visual Recognition Group, Czech Technical University in Prague

🔗 https://stojnicv.xyz
October 21, 2025 at 6:36 PM
We show that representations from some foundation models, especially CVLs like CLIP, encode information about image metadata. More surprisingly we show that such metadata traces can even affect the performance on semantic downstream tasks.
October 21, 2025 at 6:15 PM
We show that representations from some foundation models, especially CVLs like CLIP, encode information about image metadata. More surprisingly we show that such metadata traces can even affect the performance on semantic downstream tasks.
October 21, 2025 at 6:11 PM
I second the Hyperion
October 20, 2025 at 8:46 AM
When it comes to the CVL term we specifically went with it to discriminate CLIP-like VLMs from the VLMs that can generate text as the term VLM is overused and means many different things in different papers. It to an extent also follows the naming from arxiv.org/pdf/2405.17247
arxiv.org
August 18, 2025 at 3:05 PM
I agree that the terminology is confusing. However, I wouldn't agree that CLIP is an SSL method. It uses a contrastive loss, but not with self-supervised labels. DINOv2 and v3 classify it as weakly-supervised as it uses labels coming from the text.
August 18, 2025 at 3:05 PM
Many thanks to the amazing collaborators: @ryan-ramos.bsky.social , @gkordo.bsky.social , Yuta Nakashima, @gtolias.bsky.social , @noagarciad.bsky.social
August 18, 2025 at 10:48 AM
If this caught your attention, check out our new paper.

Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

arxiv.org/abs/2508.10637

To be presented at #ICCV2025 (highlight). @iccv.bsky.social
Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they in...
arxiv.org
August 18, 2025 at 10:48 AM
The same pattern can be observed for the acquisition parameters in the task of near-duplicate retrieval. If the negatives are captured using the same camera as the query, the task becomes harder for some models compared to the case when they are captured by a different camera.
August 18, 2025 at 10:48 AM
Impact on the semantic performance is again the most pronounced for contrastive VLMs, and the least for SSL models.

Here, we show kNN classification in a few cases, depending on whether the semantic positives and negatives share the same processing parameters as the test image.
August 18, 2025 at 10:48 AM
This impact is especially pronounced when there is a strong correlation/anticorrelation between the semantic and metadata labels. E.g., when semantic positives/negatives have the same/different processing parameters as a query image.
August 18, 2025 at 10:48 AM
More strikingly, we show that traces of these metadata labels (processing and acquisition parameters) can significantly impact the semantic recognition abilities.
August 18, 2025 at 10:48 AM
A similar pattern is observed for the acquisition parameters, although generally, all models have a harder time predicting these parameters than the processing ones.
August 18, 2025 at 10:48 AM