Trust me, this is the last time, we're making a benchmark without a hidden test set.
Trust me, this is the last time, we're making a benchmark without a hidden test set.
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan
Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan
Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking
📄 Paper: arxiv.org/pdf/2411.19325
🌐 Website: the-ai-alliance.github.io/GEO-Bench-VLM
💻 Code: github.com/The-AI-Allia...
📦 Dataset: huggingface.co/datasets/aia...
📄 Paper: arxiv.org/pdf/2411.19325
🌐 Website: the-ai-alliance.github.io/GEO-Bench-VLM
💻 Code: github.com/The-AI-Allia...
📦 Dataset: huggingface.co/datasets/aia...
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection
BUT…
❌ Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection
BUT…
❌ Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes
→ A task-diverse benchmark for geospatial VLM performance
→ 31 fine-grained tasks
→ 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!
→ A task-diverse benchmark for geospatial VLM performance
→ 31 fine-grained tasks
→ 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns out… we didn’t really know. Until now.
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns out… we didn’t really know. Until now.
Consider submitting your work! 🔗https://realm-workshop.github.io
Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social
Consider submitting your work! 🔗https://realm-workshop.github.io
Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)
See the leaderboard:
huggingface.co/spaces/Servi...
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)
See the leaderboard:
huggingface.co/spaces/Servi...
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena
github.com/ServiceNow/B...
github.com/ServiceNow/B...
🚀 Easy large-scale parallel agent experiments
🔧 Building blocks for crafting agents over BrowserGym
🤖 Unified LLM API for seamless integration
🔁 Reproducibility features for consistent results
🏆 Unified Leaderboard across multiple benchmarks
🚀 Easy large-scale parallel agent experiments
🔧 Building blocks for crafting agents over BrowserGym
🤖 Unified LLM API for seamless integration
🔁 Reproducibility features for consistent results
🏆 Unified Leaderboard across multiple benchmarks