Jiafei Duan
djiafei.bsky.social
Jiafei Duan
@djiafei.bsky.social
Robotics PhD student @uwcse|Graduate Student Researcher @allen_ai |Ex-@NVIDIA |@ASTARsg scholars|BEng from @ntueee. Research in robot learning and embodied AI

www.duanjiafei.com
10/🧵Curious for more? Check out our paper for the full breakdown: "SAT: Spatial Aptitude Training for Multimodal Language Models" by @ARRay693 @ehsanik @anikembhavi @rosemhendrix @RanjayKrishna @KuoHaoZeng @kate_saenko_ @drbashkirova and et al.
December 11, 2024 at 4:12 PM
9/🧵The takeaway: Dynamic spatial QAs improve static QA performance too!
Mixing static & dynamic training data results in significant accuracy gains across all tasks. 📊
December 11, 2024 at 4:12 PM
8/🧵Challenges MLMs face:
Even strong models perform near-randomly on SAT's dynamic tasks.
Egocentric movement and multiview reasoning remain tough nuts to crack.
December 11, 2024 at 4:12 PM
7/🧵SAT enables five complex spatial tasks:
Egocentric Movement
Object Movement
Allocentric Perspective
Goal Aiming
Action Consequence
Each task tests unique dimensions of spatial cognition.🧠
December 11, 2024 at 4:12 PM
6/🧵How does SAT generate data?
Uses ProcTHOR for 3D scenes.
Procedurally generates static & dynamic QAs.
Scalable, cost-effective, & adaptable for new tasks. 🏠
December 11, 2024 at 4:12 PM
5/🧵Here's the kicker: Fine-tuning on SAT makes the open-source LLaVA-13B model match or surpass proprietary giants like GPT4-V in spatial reasoning! 🎯
December 11, 2024 at 4:12 PM
4/🧵 Results? SAT improves performance not only on its own dataset but also boosts zero-shot spatial reasoning:
+23% on CVBench
+9% on BLINK (harder benchmarks)
+18% on Visual Spatial Relations (VSR) dataset. 💪
December 11, 2024 at 4:12 PM
3/🧵Example tasks SAT tackles:
Static: Is object X to the left of object Y?
Dynamic: How did the camera move between frames? Did the object get closer or further?
Perspective: What does object placement look like from point X?
December 11, 2024 at 4:12 PM
2/🧵SAT introduces 218K question-answer pairs for 22K synthetic scenes created using a photorealistic physics engine. It goes beyond static benchmarks to tackle dynamic reasoning tasks like egocentric actions, object movement, & perspective-taking. 🔍
December 11, 2024 at 4:12 PM
1/🧵 Why does spatial reasoning matter? 🌎 Cognitive science shows spatial reasoning is foundational to intelligence, impacting geometry, physics, and physical world reasoning. Yet, MLMs struggle with it, especially in dynamic real-world scenarios. Enter SAT! ⚒️
December 11, 2024 at 4:12 PM
A scene from maniskill,
Prompt: Move the mobile robot to the table and place the red bowl onto the table.
December 10, 2024 at 6:51 PM
I think text-2-video is not that bad, at least we see some good robot motion for humanoid properly cause they trained on a lot of human video. But what is not good, is image-2-video generation
December 10, 2024 at 6:50 PM