tlsdc.bsky.social
@tlsdc.bsky.social
Reposted
Notable findings:
🏆Claude-3.5-Sonnet is insanely good on WorkArena L2
🪨 WorkArena L3 is insanely hard
🤖o1-mini is quite good across many benchmarks
💲o1 is very expensive :)

See the leaderboard:
huggingface.co/spaces/Servi...
December 12, 2024 at 5:55 PM
Reposted
Visit our paper
📃https://arxiv.org/abs/2412.05467
Or our open-source tools:
🤖https://github.com/ServiceNow/AgentLab
💪https://github.com/ServiceNow/BrowserGym
🎯https://github.com/ServiceNow/WorkArena
December 12, 2024 at 5:55 PM