We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.
🔥While some benchmarks show modest gains, GPT-5 is crushing WorkArena L2🔥
➡️ 69.4% avg success vs. ~40% for next best🤯
➡️ Complex tasks, up to 100 steps, 5–20 min for humans
🔥While some benchmarks show modest gains, GPT-5 is crushing WorkArena L2🔥
➡️ 69.4% avg success vs. ~40% for next best🤯
➡️ Complex tasks, up to 100 steps, 5–20 min for humans
[#ICCV2025] Our paper "GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks" is accepted at ICCV 2025 in Honolulu, Hawaii! 🌺
Let's dive into what makes it exciting: 🧵
[#ICCV2025] Our paper "GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks" is accepted at ICCV 2025 in Honolulu, Hawaii! 🌺
Let's dive into what makes it exciting: 🧵
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖 #ACL2025NLP in Vienna 🎻
We have an exciting lineup of speakers
🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
Consider submitting your work! 🔗https://realm-workshop.github.io
Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social
Consider submitting your work! 🔗https://realm-workshop.github.io
Organizers:
@shikharmurty.bsky.social @ehsk0.bsky.social @xhluca.bsky.social @alex-lacoste.bsky.social @hanna-nlp.bsky.social @gneubig.bsky.social
medium.com/@carolynduby...
medium.com/@carolynduby...
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
NeurIPS 2024
with ServiceNow and IMean.ai
as we explore the cutting edge of WebAgent development!
📅 Date: Dec 13th 6:00pm PST
📍 Location: 15min walk from Neurips see details after RSVP
🎉 RSVP Here: lu.ma/rw9x9vc6
NeurIPS 2024
with ServiceNow and IMean.ai
as we explore the cutting edge of WebAgent development!
📅 Date: Dec 13th 6:00pm PST
📍 Location: 15min walk from Neurips see details after RSVP
🎉 RSVP Here: lu.ma/rw9x9vc6
An open, transparent multimodal dataset designed for:
📄 Documents
🌐 Web content
🖥️ GUI understanding
👨💻 Code generation from images
We’re also launching BigDocs-Bench:
➡️ Document, Web, GUI Visual reasoning
➡️ Converting images into JSON, Markdown, LaTeX, SVG, and more!
An open, transparent multimodal dataset designed for:
📄 Documents
🌐 Web content
🖥️ GUI understanding
👨💻 Code generation from images
We’re also launching BigDocs-Bench:
➡️ Document, Web, GUI Visual reasoning
➡️ Converting images into JSON, Markdown, LaTeX, SVG, and more!
An open, transparent multimodal dataset designed for:
📄 Documents
🌐 Web content
🖥️ GUI understanding
👨💻 Code generation from images
We’re also launching BigDocs-Bench:
➡️ Document, Web, GUI Visual reasoning
➡️ Converting images into JSON, Markdown, LaTeX, SVG, and more!
I am missing a lot, and many are not on bsky yet, so if I missed you or someone you know, please send me a DM with the link to a relevant paper and I will update the starter pack!
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.