Lightnews — Scholar-powered news

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

To read about, evaluate models, and use VSI-Bench, see:

Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]

Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces

We introduce VSI-Bench, a novel benchmark of over 5,000 video-based visual-spatial intelligence questions, to evaluate and probe MLLMs, which revealed that their emerging spatial reasoning and local w...

vision-x-nyu.github.io

December 23, 2024 at 10:53 PM

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

It was an honor and a pleasure to collaborate with and learn from @drfeifei.bsky.social @saining.bsky.social, Jihan Yang, Shusheng Yang, and Rilyn Han! I believe this is just the beginning for visual-spatial intelligence (and my PhD 😉) and emphasizes the importance of vision in MLLMs. [7/n]

December 23, 2024 at 10:51 PM

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

Prompting for “cognitive maps,” a concept introduced by Edward Tolman in the ‘40s for the unified representation of spatial environments brains build, we find MLLMs have a local spatial bias and that explicitly remembering spaces improves relational distance abilities. [6/n]

December 23, 2024 at 10:48 PM

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

What does it mean to “think in space”? We analyze spatial intelligence linguistically and visually.

We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]

December 23, 2024 at 10:47 PM

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

VSI-Bench tests configuration, measurement estimation, and spatiotemporal abilities across 5k+ Video QA pairs and eight task types.

We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. [4/n]

December 23, 2024 at 10:47 PM

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

We propose VSI-Bench, a 3D video-based visual-spatial intelligence benchmark designed for MLLMs.

Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]

December 23, 2024 at 10:46 PM

anjaliwgupta.bsky.social

@anjaliwgupta.bsky.social

What is visual-spatial intelligence? Visual-spatial intelligence entails perceiving and mentally manipulating spatial relationships. It requires visual perception, temporal processing, linguistic intelligence (to understand questions), and spatial reasoning. [2/n]

December 23, 2024 at 10:46 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news