Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]
Webpage: vision-x-nyu.github.io/thinking-in-...
ArXiv: arxiv.org/pdf/2412.14171
Eval Code: github.com/vision-x-nyu...
VSI-Bench: huggingface.co/datasets/nyu...
[n/n]
We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]
We analyze self-explanations to attribute VSI-Bench performance to visual-spatial capabilities and find that spatial and linguistic intelligence are very distinct. [5/n]
We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. [4/n]
We evaluate VSI-Bench on open- and closed-source MLLMs and find that MLLMs exhibit competitive—though subhuman—visual-spatial intelligence. [4/n]
Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]
Video mirrors how humans perceive spaces continually and temporally, and, by repurposing 3D reconstruction datasets, these videos encompass and test complete indoor scenes. [3/n]