🖱️ Models struggle with click & drag actions due to poor grounding and limited motion understanding
🏆 UI-TARS leads across models!
🧠 Closed models (GPT-4o, Claude, Gemini) excel at planning but fail to localize.
🖱️ Models struggle with click & drag actions due to poor grounding and limited motion understanding
🏆 UI-TARS leads across models!
🧠 Closed models (GPT-4o, Claude, Gemini) excel at planning but fail to localize.
🤖 Even top GUI agents miss functional regions.
🏆 Closed-source VLMs shine with stronger visual understanding.
📉 Cluttered UIs bring down IoU.
🚀 We’re the first to propose this task.
🤖 Even top GUI agents miss functional regions.
🏆 Closed-source VLMs shine with stronger visual understanding.
📉 Cluttered UIs bring down IoU.
🚀 We’re the first to propose this task.
🤖 Even top VLMs struggle with fine-grained GUI grounding.
📊 GUI agents like UI-TARS (25.5%) & UGround (23.2%) do better but still fall short.
⚠️ Small elements, dense UIs, and limited domain/spatial understanding are major hurdles.
🤖 Even top VLMs struggle with fine-grained GUI grounding.
📊 GUI agents like UI-TARS (25.5%) & UGround (23.2%) do better but still fall short.
⚠️ Small elements, dense UIs, and limited domain/spatial understanding are major hurdles.
🔹 Element Grounding – Identify a UI element from the text
🔹 Layout Grounding – Understand UI layout structure & group elements
🔹 Action Prediction – Predict the next action given a goal, past actions & screen state
🔹 Element Grounding – Identify a UI element from the text
🔹 Layout Grounding – Understand UI layout structure & group elements
🔹 Action Prediction – Predict the next action given a goal, past actions & screen state
✅ 83 open-source desktop apps across 6 domains
✅ 450 human demonstrations of computer-use workflows
✅ Human annotated dense bounding box annotations for UI elements and rich action trajectories
✅ 83 open-source desktop apps across 6 domains
✅ 450 human demonstrations of computer-use workflows
✅ Human annotated dense bounding box annotations for UI elements and rich action trajectories
🖥️ But what about desktop software, where most real work happens?
UI-Vision fills this gap by providing a large-scale benchmark with diverse and dense annotations to systematically evaluate GUI agents.
🖥️ But what about desktop software, where most real work happens?
UI-Vision fills this gap by providing a large-scale benchmark with diverse and dense annotations to systematically evaluate GUI agents.
📄 Paper: arxiv.org/abs/2503.15661
🌐 Website: uivision.github.io
🧵 Key takeaways 👇
📄 Paper: arxiv.org/abs/2503.15661
🌐 Website: uivision.github.io
🧵 Key takeaways 👇