tlsdc.bsky.social
@tlsdc.bsky.social
Reposted
Notable findings:
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)

See the leaderboard:
huggingface.co/spaces/Servi...
December 12, 2024 at 5:55 PM
Reposted
Visit our paper
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena
December 12, 2024 at 5:55 PM
Reposted
We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof.

In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
December 12, 2024 at 5:55 PM
Reposted
🧵-1
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.
December 3, 2024 at 9:02 PM