Antonio Tejero-de-Pablos
banner
toni-tiler.bsky.social
Antonio Tejero-de-Pablos
@toni-tiler.bsky.social
Research scientist in computer vision / Samurai / Rapper
Am I the only one who when vibe-codes feels like "the LLM whisperer"?
November 17, 2025 at 8:42 AM
In the multi-object setting, objects that appear earlier in the caption or that are larger in size are prioritized by the CLIP encoder.
arxiv.org/abs/2502.19842
April 20, 2025 at 2:38 PM
This well-known problem is not being dealt with properly. A tutorial in SIGIR2024 proposed a framework in which a model masters fundamental reasoning tasks with basic curated data, without making it learn tons of Internet sh*t. Same as humans do at school. The rest is information retrieval.
February 17, 2025 at 2:04 AM
This benchmark paper on Multimodal Retrieval-Augmented Multimodal Generation provides interesting insight on the task:
- Challenges in long-text, high-image-density tasks
- The image ordering constraint proves to be an unsolved challenge
February 7, 2025 at 8:45 AM
Reducing the mutual information from the multimodal embeddings for different classes yields disentanglement and accuracy improvement

Disentangling CLIP Features for Enhanced Localized Understanding (arxiv.org/abs/2502.02977)
February 6, 2025 at 7:25 AM
A better way of encoding images and text in Large Vision Language Models (LVLM)
arxiv.org/abs/2502.01906

By the way, do you say LVLM or Multimodal Large Language Models (MLLM)? I don't think there's a clear naming convention 🤷
February 5, 2025 at 7:07 AM
Today's the RAG (retrieval-augmented generation) day!
- SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Models (arxiv.org/html/2501.18...)
- RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning (arxiv.org/html/2502.00...)
February 4, 2025 at 2:13 PM
A method for retrieval augmented generation that adds robustness against irrelevant information:
1. Defects Detection: Evaluating the existence of misinformation in the retrieval results.
2. Utility Extraction: Generation of correct answers even from defective inputs.
arxiv.org/abs/2501.18365
February 3, 2025 at 3:43 AM
A simple but valid baseline for multimodal RAG (retrieval-augmented generation)
arxiv.org/abs/2501.15470
January 30, 2025 at 5:51 AM
I've seen several papers recently focused on smoothly integrating images into RAG (retrieval-augmented generation) results for multimodal LLMs
- MuKA: aclanthology.org/2025.coling-...
- ImageRef-VL: arxiv.org/abs/2501.12418
January 27, 2025 at 1:40 PM
How to edit knowledge in a Multimodal LLM without modifying unrelated knowledge (arxiv.org/abs/2412.12821)
December 23, 2024 at 12:42 PM
Papers don't stop coming! Decomposing hierarchical captions to train vision-language models improves the accuracy of several downstream tasks (arxiv.org/abs/2412.08110)
December 12, 2024 at 12:51 PM
When using CLIP (or similar VLMs) as an embedder for a large-scale multimodal model (LMM), the accuracy of in-context learning can be improved by aligning CLIP's ranking (the way it assigns relevance scores) with the LMM's ranking.
arxiv.org/abs/2412.07619
December 12, 2024 at 12:33 PM
A paper explaining how, in order to succeed in training a CLIP-like contrastive-based VL model, the alignment between the image and text encoders should be maintained
arxiv.org/abs/2412.04616
December 10, 2024 at 6:55 AM
How to make multimodal large language models more robust towards composite images (infographs, diagrams, etc.)
arxiv.org/abs/2412.05243
December 9, 2024 at 6:50 AM
Richer text in contrastive vision-language pretraining improves downstream performance. This was kind of already known, but it describes patterns for effective text augmentations:
arxiv.org/abs/2412.00440

Also, the reason image augmentations are not used is probably this:
arxiv.org/abs/2405.187...
December 4, 2024 at 6:27 AM
A method to improve the accuracy of composed retrieval by generating the image you want to search for (in some cases simply using the generated image may be even better LOL)

arxiv.org/pdf/2411.16752
December 2, 2024 at 1:11 PM