rajagiryes.bsky.social
@rajagiryes.bsky.social
I prefer the comment in a Ghibli style 🙂
March 26, 2025 at 7:29 PM
Time to move to arxiv + gs
March 2, 2025 at 11:17 PM
This work was a great collaboration with Assaf Ben-Kish, Itamar Zimmerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson and Lior Wolf

Please see our paper for more results and details!
arxiv.org/abs/2406.14528

Code and models are available in github.com/assafbk/Deci...
DeciMamba: Exploring the Length Extrapolation Potential of Mamba
Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance a...
arxiv.org
February 25, 2025 at 7:11 AM
In synthetic tasks, as well as in real-world long-range NLP tasks, DeciMamba is able to extrapolate to sequences that are magnitudes longer than the ones seen during training. It does so while requiring less computational resources and doesn't require retraining!
February 25, 2025 at 7:11 AM
To mitigate this behavior, we propose DeciMamba (Decimating-Mamba), the first context extension method for Mamba. We interpret the selective Delta_t value as a token's importance score, and discard un-important tokens in order to restore Mamba's receptive field.
February 25, 2025 at 7:11 AM
Mamba is super efficient, making it ideal for long-context tasks. Yet, we find that Mamba's extrapolation capabilities are limited due to unintentional effective receptive fields (ERFs), a counter-intuitive result given that in theory, each layer has an infinite receptive field.
February 25, 2025 at 7:11 AM
9/ Thanks to whole team, bringing LiveXiv to life: Nimrod Shabtay, Felipe Maia Polo, @sdoveh.bsky.social, Wei Lin, Jehanzeb Mirza, @lchoshen.bsky.social, @m-yurochkin.bsky.social, @aarbelle.bsky.social, @leokarlin.bsky.social
February 7, 2025 at 8:23 PM
8/ In addition, we provide analyses based on the visual and the text contents, where we divide the questions and the visual figures according to predefined categories.
Check the paper for mode details (arxiv.org/abs/2410.10783 )
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple ...
arxiv.org
February 7, 2025 at 8:23 PM
7/ Moreover, the results reveal a potential dataset contamination. Our table compares performance on static datasets vs. LiveXiv. Models with higher ranking on static data, shown by negative values in the difference column. This suggests possible test-set contamination.
February 7, 2025 at 8:23 PM
6/ Our results from arXiv domains reveal LMMs performance and robustness. the following plot shows older models like InstructBLIP perform poorly across domains, while newer ones like QWEN2-VL show strong robustness. Some models fall in between, showing high domain sensitivity.
February 7, 2025 at 8:23 PM
5/ Our 1st version includes 16K+ questions on tables (TQA) and figures (VQA), evaluating 17 LMMs. Using Item Response Theory, we estimate model scores on new data without full re-evaluation. Our approach cuts evaluation costs by 70%, requiring full evaluation of only 5 models
February 7, 2025 at 8:23 PM
4/ We ensure a high-quality dataset through filtering of auto-generated questions. To reduce hallucinations, we validate image-question-answer triplets with an LMM and discard questions consistently answered correctly by an LLM without images, ensuring a truly multi-modal dataset
February 7, 2025 at 8:23 PM
3/ To overcome all of the above we propose LiveXiv, an automatic large-scale live multi-modal scientific benchmark. Based on arxiv papers we generated VQA using an LMM. With an efficient evaluation method, we aim to reduce the logistical burden of keeping evaluations feasible.
February 7, 2025 at 8:23 PM
2/ Live benchmarks tackle test-set contamination by using evolving datasets that are unusable for LMM training. While proved their success in mitigating contamination, they demand significant engineering effort to maintain the dynamic datasets and an up to date leaderboard.
February 7, 2025 at 8:23 PM
1/ The large scale training data for LMMs has shown exceptional abilities for many downstream tasks, However, current static datasets are at risk of test data contamination due this large scale training, potentially reflecting false abilities on given tasks.
February 7, 2025 at 8:23 PM
7/ Moreover, the results reveal a potential dataset contamination. Our table compares performance on static datasets vs. LiveXiv. Models with higher ranking on static data, shown by negative values in the difference column. This suggests possible test-set contamination.
February 7, 2025 at 8:10 PM
6/ Our results from arXiv domains reveal LMMs performance and robustness. the following plot shows older models like InstructBLIP perform poorly across domains, while newer ones like QWEN2-VL show strong robustness. Some models fall in between, showing high domain sensitivity.
February 7, 2025 at 8:10 PM
5/ Our 1st version includes 16K+ questions on tables (TQA) and figures (VQA), evaluating 17 LMMs. Using Item Response Theory, we estimate model scores on new data without full re-evaluation. Our approach cuts evaluation costs by 70%, requiring full evaluation of only 5 models
February 7, 2025 at 8:10 PM
4/ We ensure a high-quality dataset through filtering of auto-generated questions. To reduce hallucinations, we validate image-question-answer triplets with an LMM and discard questions consistently answered correctly by an LLM without images, ensuring a truly multi-modal dataset
February 7, 2025 at 8:10 PM
3/ To overcome all of the above we propose LiveXiv, an automatic large-scale live multi-modal scientific benchmark. Based on arxiv papers we generated VQA using an LMM. With an efficient evaluation method, we aim to reduce the logistical burden of keeping evaluations feasible.
February 7, 2025 at 8:10 PM