anjaliwgupta.bsky.social
@anjaliwgupta.bsky.social
It was an honor and a pleasure to collaborate with and learn from @drfeifei.bsky.social @saining.bsky.social, Jihan Yang, Shusheng Yang, and Rilyn Han! I believe this is just the beginning for visual-spatial intelligence (and my PhD 😉) and emphasizes the importance of vision in MLLMs. [7/n]
December 23, 2024 at 10:51 PM
Prompting for “cognitive maps,” a concept introduced by Edward Tolman in the ‘40s for the unified representation of spatial environments brains build, we find MLLMs have a local spatial bias and that explicitly remembering spaces improves relational distance abilities. [6/n]
December 23, 2024 at 10:48 PM
What does it mean to “think in space”? We analyze spatial intelligence linguistically and visually.

We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]
December 23, 2024 at 10:47 PM
VSI-Bench tests configuration, measurement estimation, and spatiotemporal abilities across 5k+ Video QA pairs and eight task types.

We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. [4/n]
December 23, 2024 at 10:47 PM
We propose VSI-Bench, a 3D video-based visual-spatial intelligence benchmark designed for MLLMs.

Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]
December 23, 2024 at 10:46 PM
What is visual-spatial intelligence? Visual-spatial intelligence entails perceiving and mentally manipulating spatial relationships. It requires visual perception, temporal processing, linguistic intelligence (to understand questions), and spatial reasoning. [2/n]
December 23, 2024 at 10:46 PM