Lightnews — Scholar-powered news

Xiangru (Edward) Jian

@edwardjian.bsky.social

7 followers 8 following 10 posts

CS PhD student at University of Waterloo. Visiting Researcher at ServiceNow Research. Working on Al and DB.

Posts Replies Media Videos

Xiangru (Edward) Jian

@edwardjian.bsky.social

Huge thanks to my wonderful coauthor
@shravannayak.bsky.social and the amazing team at @servicenowresearch.bsky.social and @mila-quebec.bsky.social.

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

We want UI-Vision to be the go-to benchmark for desktop GUI agents.
📢 Data, benchmarks & code coming soon!
💡Next: scaling training data & models for long-horizon tasks.
Let’s build, benchmark & push GUI agents forward 🚀

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

Interacting with desktop GUIs remains a challenge.
🖱️ Models struggle with click & drag actions due to poor grounding and limited motion understanding
🏆 UI-TARS leads across models!
🧠 Closed models (GPT-4o, Claude, Gemini) excel at planning but fail to localize.

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

Detecting functional UI regions is tough!
🤖 Even top GUI agents miss functional regions.
🏆 Closed-source VLMs shine with stronger visual understanding.
📉 Cluttered UIs bring down IoU.
🚀 We’re the first to propose this task.

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

Grounding UI elements is challenging!
🤖 Even top VLMs struggle with fine-grained GUI grounding.
📊 GUI agents like UI-TARS (25.5%) & UGround (23.2%) do better but still fall short.
⚠️ Small elements, dense UIs, and limited domain/spatial understanding are major hurdles.

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

We propose three key benchmark tasks to evaluate GUI Agents
🔹 Element Grounding – Identify a UI element from the text
🔹 Layout Grounding – Understand UI layout structure & group elements
🔹 Action Prediction – Predict the next action given a goal, past actions & screen state

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

UI-Vision consists of
✅ 83 open-source desktop apps across 6 domains
✅ 450 human demonstrations of computer-use workflows
✅ Human annotated dense bounding box annotations for UI elements and rich action trajectories

March 24, 2025 at 5:08 PM

Xiangru (Edward) Jian

@edwardjian.bsky.social

Most GUI benchmarks focus on web or mobile.
🖥️ But what about desktop software, where most real work happens?
UI-Vision fills this gap by providing a large-scale benchmark with diverse and dense annotations to systematically evaluate GUI agents.

March 24, 2025 at 5:08 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news