rajagiryes.bsky.social
@rajagiryes.bsky.social
I prefer the comment in a Ghibli style 🙂
March 26, 2025 at 7:29 PM
In synthetic tasks, as well as in real-world long-range NLP tasks, DeciMamba is able to extrapolate to sequences that are magnitudes longer than the ones seen during training. It does so while requiring less computational resources and doesn't require retraining!
February 25, 2025 at 7:11 AM
To mitigate this behavior, we propose DeciMamba (Decimating-Mamba), the first context extension method for Mamba. We interpret the selective Delta_t value as a token's importance score, and discard un-important tokens in order to restore Mamba's receptive field.
February 25, 2025 at 7:11 AM
Mamba is super efficient, making it ideal for long-context tasks. Yet, we find that Mamba's extrapolation capabilities are limited due to unintentional effective receptive fields (ERFs), a counter-intuitive result given that in theory, each layer has an infinite receptive field.
February 25, 2025 at 7:11 AM
DeciMamba, the first context extension method for Mamba, is accepted to #ICLR2025!

What prevents Mamba from extrapolating to sequences that are significantly longer than those it was trained on?
Furthermore, can Mamba solve long-range NLP tasks using short-range training only?
February 25, 2025 at 7:11 AM
7/ Moreover, the results reveal a potential dataset contamination. Our table compares performance on static datasets vs. LiveXiv. Models with higher ranking on static data, shown by negative values in the difference column. This suggests possible test-set contamination.
February 7, 2025 at 8:23 PM
6/ Our results from arXiv domains reveal LMMs performance and robustness. the following plot shows older models like InstructBLIP perform poorly across domains, while newer ones like QWEN2-VL show strong robustness. Some models fall in between, showing high domain sensitivity.
February 7, 2025 at 8:23 PM
5/ Our 1st version includes 16K+ questions on tables (TQA) and figures (VQA), evaluating 17 LMMs. Using Item Response Theory, we estimate model scores on new data without full re-evaluation. Our approach cuts evaluation costs by 70%, requiring full evaluation of only 5 models
February 7, 2025 at 8:23 PM
4/ We ensure a high-quality dataset through filtering of auto-generated questions. To reduce hallucinations, we validate image-question-answer triplets with an LMM and discard questions consistently answered correctly by an LLM without images, ensuring a truly multi-modal dataset
February 7, 2025 at 8:23 PM
3/ To overcome all of the above we propose LiveXiv, an automatic large-scale live multi-modal scientific benchmark. Based on arxiv papers we generated VQA using an LMM. With an efficient evaluation method, we aim to reduce the logistical burden of keeping evaluations feasible.
February 7, 2025 at 8:23 PM
1/ The large scale training data for LMMs has shown exceptional abilities for many downstream tasks, However, current static datasets are at risk of test data contamination due this large scale training, potentially reflecting false abilities on given tasks.
February 7, 2025 at 8:23 PM
7/ Moreover, the results reveal a potential dataset contamination. Our table compares performance on static datasets vs. LiveXiv. Models with higher ranking on static data, shown by negative values in the difference column. This suggests possible test-set contamination.
February 7, 2025 at 8:10 PM
6/ Our results from arXiv domains reveal LMMs performance and robustness. the following plot shows older models like InstructBLIP perform poorly across domains, while newer ones like QWEN2-VL show strong robustness. Some models fall in between, showing high domain sensitivity.
February 7, 2025 at 8:10 PM
5/ Our 1st version includes 16K+ questions on tables (TQA) and figures (VQA), evaluating 17 LMMs. Using Item Response Theory, we estimate model scores on new data without full re-evaluation. Our approach cuts evaluation costs by 70%, requiring full evaluation of only 5 models
February 7, 2025 at 8:10 PM
4/ We ensure a high-quality dataset through filtering of auto-generated questions. To reduce hallucinations, we validate image-question-answer triplets with an LMM and discard questions consistently answered correctly by an LLM without images, ensuring a truly multi-modal dataset
February 7, 2025 at 8:10 PM
3/ To overcome all of the above we propose LiveXiv, an automatic large-scale live multi-modal scientific benchmark. Based on arxiv papers we generated VQA using an LMM. With an efficient evaluation method, we aim to reduce the logistical burden of keeping evaluations feasible.
February 7, 2025 at 8:10 PM
1/ The large scale training data for LMMs has shown exceptional abilities for many downstream tasks, However, current static datasets are at risk of test data contamination due this large scale training, potentially reflecting false abilities on given tasks.
February 7, 2025 at 8:10 PM
5/ 📊 Results:
Across small-scale VLMs and dense caption datasets, KnowAda:
✅ Reduces hallucinations while preserving descriptiveness.
✅ Provides control over the hallucination-descriptiveness tradeoff.
✅ Outperforms baselines on automatic metrics and human evaluations.
January 26, 2025 at 3:44 PM
4/ 🎯 Our Evaluation Framework: DNLI
We introduce Decomposed NLI (DNLI), which breaks captions into atomic propositions for NLI evaluation.
This fine-grained approach aligns closely with human intuition, enabling a more detailed and accurate assessment of caption quality.
January 26, 2025 at 3:44 PM
3/ 💡 The Solution: Knowledge Adapted (KnowAda) Fine-Tuning
Here’s how KnowAda works:
It identifies VLM knowledge gaps in each caption.
Then it adjusts or removes details tied to these gaps, ensuring captions align with the model’s existing knowledge.
January 26, 2025 at 3:44 PM
1/ TL;DR: We propose KnowAda, a data-centric fine-tuning approach that adapts captions to the VLM's knowledge, balancing descriptiveness and hallucinations in the fine-tuned VLM. We also introduce Decomposed NLI (DNLI), a framework for fine-grained dense-caption evaluation.
January 26, 2025 at 3:44 PM