Alexandre Lacoste
alex-lacoste.bsky.social
Alexandre Lacoste
@alex-lacoste.bsky.social
MegaSenior Research Scientist at ServiceNow Research, Former Google. WebAgents, Remote Sensing, Climate Change, Opinions are my own
What is your guess? Why is GPT-5 shining so much on WorkArena in contrast to other benchmarks?

Trust me, this is the last time, we're making a benchmark without a hidden test set.
August 21, 2025 at 6:23 PM
🙌 Huge thanks to the team:
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan

Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking
July 2, 2025 at 12:47 PM
arxiv.org
July 2, 2025 at 12:47 PM
🔍 What we found:
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection

BUT…
❌ Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes
July 2, 2025 at 12:47 PM
🧪 We built GEOBench-VLM
→ A task-diverse benchmark for geospatial VLM performance
→ 31 fine-grained tasks
→ 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!
July 2, 2025 at 12:47 PM
🌍 Why GEOBench-VLM?
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns out… we didn’t really know. Until now.
July 2, 2025 at 12:47 PM
Reposted by Alexandre Lacoste
Got ideas to share and want to learn about the latest progress?

Consider submitting your work! 🔗https://realm-workshop.github.io

Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social
January 23, 2025 at 2:29 PM
Notable findings:
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)

See the leaderboard:
huggingface.co/spaces/Servi...
December 12, 2024 at 5:55 PM
Visit our paper
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena
December 12, 2024 at 5:55 PM
🔍 Analyse your agent's behavior using AgentLab-XRay, a custom UI allowing you to navigate all your experiments.
December 3, 2024 at 9:02 PM
Seamless integration with 10 different web agent benchmarks provided by BrowserGym
github.com/ServiceNow/B...
December 3, 2024 at 9:02 PM
AgentLab: github.com/ServiceNow/AgentLab/
🚀 Easy large-scale parallel agent experiments
🔧 Building blocks for crafting agents over BrowserGym
🤖 Unified LLM API for seamless integration
🔁 Reproducibility features for consistent results
🏆 Unified Leaderboard across multiple benchmarks
December 3, 2024 at 9:02 PM