Paul Gavrikov
banner
paulgavrikov.bsky.social
Paul Gavrikov
@paulgavrikov.bsky.social
PostDoc Tübingen AI Center | Machine Learning & Computer Vision
paulgavrikov.github.io
Agree 100%! I think this paper does a great job of outlining issues in the original paper.
October 8, 2025 at 7:43 PM
If you think of texture as the material/surface property (which I think is the original perspective), then the ablation in this paper is insufficient to suppress the cue.
October 8, 2025 at 4:43 PM
I really liked the thoroughness of this paper but I’m afraid that the results are building on a shaky definition of „texture“. If you replace texture in the original paper with local details it’s virtually the same finding.
October 8, 2025 at 4:43 PM
4) Models answer consistently for easy questions ("Is it day?": yes, "Is it night?": no) but fall back to guessing for hard tasks such as reasoning. Concerningly, some models even fall below random chance, hinting at shortcuts.
October 1, 2025 at 1:17 PM
3) Similar trends for OCR. Our OCR questions contain constraints (e.g., the fifth word) that models often fail to consider. Minor errors include a strong tendency to autocorrect typos or to hallucinate more common spellings, especially in non-Latin/English.
October 1, 2025 at 1:17 PM
2) Models cannot count in dense scenes, and the performance gets worse the larger the number of objects; they typically "undercount" and errors are massive. Here is the distribution over all models:
October 1, 2025 at 1:17 PM
1) Our benchmark is hard: the best model (o3) achieves an accuracy of 69.5% in total, but only 19.6% on the hardest split. We observe significant performance drops on some tasks.
October 1, 2025 at 1:17 PM
Our questions are built on top of a fresh dataset of 150 high-resolution and detailed scenes probing core vision skills in 6 categories: counting, OCR, reasoning, activity/attribute/global scene recognition. The ground truth is private, and our eval server is live!
October 1, 2025 at 1:17 PM
Joint work with Wei Lin, Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, James Glass, and Hilde Kuehne.
September 8, 2025 at 3:28 PM
Paper coming soon! In the meantime:
• Try your model: huggingface.co/spaces/paulg...
• Dataset: huggingface.co/datasets/pau...
• Code: github.com/paulgavrikov...
September 8, 2025 at 3:28 PM
🤖 We tested 37 models. Results?
Even top VLMs break down on “easy” tasks in overloaded scenes.

Best model (o3):
• 19.8% accuracy (hardest split)
• 69.5% overall
September 8, 2025 at 3:28 PM
📊 VisualOverload =
• 2,720 Q–A pairs
• 6 vision tasks
• 150 fresh, high-res, royalty-free artworks
• Privately held ground-truth responses
September 8, 2025 at 3:28 PM
May 3, 2025 at 10:03 AM
It was truly special reconnecting with old friends and making so many new ones. Beyond the conference halls, we had some unforgettable adventures — exploring the city, visiting the woodlands, and singing our hearts out at karaoke nights. 🎤🦁🌳
May 3, 2025 at 10:03 AM
Looking forward to meet you!
April 24, 2025 at 1:45 AM