Lightnews — Scholar-powered news

Jiafei Duan

@djiafei.bsky.social

160 followers 330 following 19 posts

Robotics PhD student @uwcse|Graduate Student Researcher @allen_ai |Ex-@NVIDIA |@ASTARsg scholars|BEng from @ntueee. Research in robot learning and embodied AI

www.duanjiafei.com

Posts Replies Media Videos

Jiafei Duan

@djiafei.bsky.social

Paper link: arxiv.org/abs/2412.07755

SAT: Spatial Aptitude Training for Multimodal Language Models

Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spati...

arxiv.org

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

10/🧵Curious for more? Check out our paper for the full breakdown: "SAT: Spatial Aptitude Training for Multimodal Language Models" by @ARRay693 @ehsanik @anikembhavi @rosemhendrix @RanjayKrishna @KuoHaoZeng @kate_saenko_ @drbashkirova and et al.

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

9/🧵The takeaway: Dynamic spatial QAs improve static QA performance too!
Mixing static & dynamic training data results in significant accuracy gains across all tasks. 📊

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

8/🧵Challenges MLMs face:
Even strong models perform near-randomly on SAT's dynamic tasks.
Egocentric movement and multiview reasoning remain tough nuts to crack.

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

7/🧵SAT enables five complex spatial tasks:
Egocentric Movement
Object Movement
Allocentric Perspective
Goal Aiming
Action Consequence
Each task tests unique dimensions of spatial cognition.🧠

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

6/🧵How does SAT generate data?
Uses ProcTHOR for 3D scenes.
Procedurally generates static & dynamic QAs.
Scalable, cost-effective, & adaptable for new tasks. 🏠

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

5/🧵Here's the kicker: Fine-tuning on SAT makes the open-source LLaVA-13B model match or surpass proprietary giants like GPT4-V in spatial reasoning! 🎯

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

4/🧵 Results? SAT improves performance not only on its own dataset but also boosts zero-shot spatial reasoning:
+23% on CVBench
+9% on BLINK (harder benchmarks)
+18% on Visual Spatial Relations (VSR) dataset. 💪

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

3/🧵Example tasks SAT tackles:
Static: Is object X to the left of object Y?
Dynamic: How did the camera move between frames? Did the object get closer or further?
Perspective: What does object placement look like from point X?

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

2/🧵SAT introduces 218K question-answer pairs for 22K synthetic scenes created using a photorealistic physics engine. It goes beyond static benchmarks to tackle dynamic reasoning tasks like egocentric actions, object movement, & perspective-taking. 🔍

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

1/🧵 Why does spatial reasoning matter? 🌎 Cognitive science shows spatial reasoning is foundational to intelligence, impacting geometry, physics, and physical world reasoning. Yet, MLMs struggle with it, especially in dynamic real-world scenarios. Enter SAT! ⚒️

December 11, 2024 at 4:12 PM

Jiafei Duan

@djiafei.bsky.social

A scene from maniskill,
Prompt: Move the mobile robot to the table and place the red bowl onto the table.

December 10, 2024 at 6:51 PM

Jiafei Duan

@djiafei.bsky.social

I think text-2-video is not that bad, at least we see some good robot motion for humanoid properly cause they trained on a lot of human video. But what is not good, is image-2-video generation

December 10, 2024 at 6:50 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news