Our new paper shows how monitoring and intervention can prevent agents from going rogue, boosting performance by up to 20%. We're also releasing a new multi-agent environment 🕵️♂️
Our new paper shows how monitoring and intervention can prevent agents from going rogue, boosting performance by up to 20%. We're also releasing a new multi-agent environment 🕵️♂️
Preprint: arxiv.org/abs/2410.00752
Leaderboard: testgeneval.github.io/leaderboard....
Preprint: arxiv.org/abs/2410.00752
Leaderboard: testgeneval.github.io/leaderboard....
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
In this TMLR paper, we dive in-depth into #BrowserGym and #AgentLab. We also present some unexpected performances from Claude 3.5-Sonnet
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.
We are thrilled to release #AgentLab, a new open-source package for developing and evaluating web agents. This builds on the new #BrowserGym package which supports 10 different benchmarks, including #WebArena.