Lightnews — Scholar-powered news

Alexandre Lacoste

@alex-lacoste.bsky.social

130 followers 180 following 20 posts

MegaSenior Research Scientist at ServiceNow Research, Former Google. WebAgents, Remote Sensing, Climate Change, Opinions are my own

Posts Replies Media Videos

Alexandre Lacoste

@alex-lacoste.bsky.social

What is your guess? Why is GPT-5 shining so much on WorkArena in contrast to other benchmarks?

Trust me, this is the last time, we're making a benchmark without a hidden test set.

August 21, 2025 at 6:23 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

🙌 Huge thanks to the team:
Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan

Follow for updates!
#ICCV2025 #VLMs #AI4EO #RemoteSensing #GeospatialAI #MachineLearning #Benchmarking

July 2, 2025 at 12:47 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

📎 Resources

📄 Paper: arxiv.org/pdf/2411.19325

🌐 Website: the-ai-alliance.github.io/GEO-Bench-VLM

💻 Code: github.com/The-AI-Allia...

📦 Dataset: huggingface.co/datasets/aia...

arxiv.org

July 2, 2025 at 12:47 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

🔍 What we found:
* GPT-4o crushes it on classification
* LLaVA-OneVision is best at counting
* EarthDial leads in event detection

BUT…
❌ Most VLMs fail on:
* Temporal reasoning
* Non-optical imagery
* Dense object scenes

July 2, 2025 at 12:47 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

🧪 We built GEOBench-VLM
→ A task-diverse benchmark for geospatial VLM performance
→ 31 fine-grained tasks
→ 8 categories:
scene understanding, classification, localization, counting, events, captions, segmentation, and more!

July 2, 2025 at 12:47 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

🌍 Why GEOBench-VLM?
VLMs like GPT-4o & LLaVA have wowed us on general vision tasks.
But how do they perform on geospatial challenges like satellite imagery, temporal reasoning, or dense object scenes?
Turns out… we didn’t really know. Until now.

July 2, 2025 at 12:47 PM

Reposted by Alexandre Lacoste

Nouha Dziri

@nouhadziri.bsky.social

Got ideas to share and want to learn about the latest progress?

Consider submitting your work! 🔗https://realm-workshop.github.io

Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social

January 23, 2025 at 2:29 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

Notable findings:
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)

See the leaderboard:
huggingface.co/spaces/Servi...

December 12, 2024 at 5:55 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

Visit our paper
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena

December 12, 2024 at 5:55 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

🔍 Analyse your agent's behavior using AgentLab-XRay, a custom UI allowing you to navigate all your experiments.

December 3, 2024 at 9:02 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

Seamless integration with 10 different web agent benchmarks provided by BrowserGym
github.com/ServiceNow/B...

December 3, 2024 at 9:02 PM

Alexandre Lacoste

@alex-lacoste.bsky.social

AgentLab: github.com/ServiceNow/AgentLab/
🚀 Easy large-scale parallel agent experiments
🔧 Building blocks for crafting agents over BrowserGym
🤖 Unified LLM API for seamless integration
🔁 Reproducibility features for consistent results
🏆 Unified Leaderboard across multiple benchmarks

December 3, 2024 at 9:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news