Lightnews — Scholar-powered news

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

Am I the only one who when vibe-codes feels like "the LLM whisperer"?

November 17, 2025 at 8:42 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

In the multi-object setting, objects that appear earlier in the caption or that are larger in size are prioritized by the CLIP encoder.
arxiv.org/abs/2502.19842

April 20, 2025 at 2:38 PM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

This well-known problem is not being dealt with properly. A tutorial in SIGIR2024 proposed a framework in which a model masters fundamental reasoning tasks with basic curated data, without making it learn tons of Internet sh*t. Same as humans do at school. The rest is information retrieval.

February 17, 2025 at 2:04 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

This benchmark paper on Multimodal Retrieval-Augmented Multimodal Generation provides interesting insight on the task:
- Challenges in long-text, high-image-density tasks
- The image ordering constraint proves to be an unsolved challenge

February 7, 2025 at 8:45 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

Reducing the mutual information from the multimodal embeddings for different classes yields disentanglement and accuracy improvement

Disentangling CLIP Features for Enhanced Localized Understanding (arxiv.org/abs/2502.02977)

February 6, 2025 at 7:25 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

A better way of encoding images and text in Large Vision Language Models (LVLM)
arxiv.org/abs/2502.01906

By the way, do you say LVLM or Multimodal Large Language Models (MLLM)? I don't think there's a clear naming convention 🤷

February 5, 2025 at 7:07 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

Today's the RAG (retrieval-augmented generation) day!
- SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Models (arxiv.org/html/2501.18...)
- RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning (arxiv.org/html/2502.00...)

February 4, 2025 at 2:13 PM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

A method for retrieval augmented generation that adds robustness against irrelevant information:
1. Defects Detection: Evaluating the existence of misinformation in the retrieval results.
2. Utility Extraction: Generation of correct answers even from defective inputs.
arxiv.org/abs/2501.18365

February 3, 2025 at 3:43 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

A simple but valid baseline for multimodal RAG (retrieval-augmented generation)
arxiv.org/abs/2501.15470

January 30, 2025 at 5:51 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

I've seen several papers recently focused on smoothly integrating images into RAG (retrieval-augmented generation) results for multimodal LLMs
- MuKA: aclanthology.org/2025.coling-...
- ImageRef-VL: arxiv.org/abs/2501.12418

January 27, 2025 at 1:40 PM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

How to edit knowledge in a Multimodal LLM without modifying unrelated knowledge (arxiv.org/abs/2412.12821)

December 23, 2024 at 12:42 PM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

Papers don't stop coming! Decomposing hierarchical captions to train vision-language models improves the accuracy of several downstream tasks (arxiv.org/abs/2412.08110)

December 12, 2024 at 12:51 PM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

When using CLIP (or similar VLMs) as an embedder for a large-scale multimodal model (LMM), the accuracy of in-context learning can be improved by aligning CLIP's ranking (the way it assigns relevance scores) with the LMM's ranking.
arxiv.org/abs/2412.07619

December 12, 2024 at 12:33 PM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

A paper explaining how, in order to succeed in training a CLIP-like contrastive-based VL model, the alignment between the image and text encoders should be maintained
arxiv.org/abs/2412.04616

December 10, 2024 at 6:55 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

How to make multimodal large language models more robust towards composite images (infographs, diagrams, etc.)
arxiv.org/abs/2412.05243

December 9, 2024 at 6:50 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

Richer text in contrastive vision-language pretraining improves downstream performance. This was kind of already known, but it describes patterns for effective text augmentations:
arxiv.org/abs/2412.00440

Also, the reason image augmentations are not used is probably this:
arxiv.org/abs/2405.187...

December 4, 2024 at 6:27 AM

Antonio Tejero-de-Pablos

@toni-tiler.bsky.social

A method to improve the accuracy of composed retrieval by generating the image you want to search for (in some cases simply using the generated image may be even better LOL)

arxiv.org/pdf/2411.16752

December 2, 2024 at 1:11 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news