Yixiao Song
yixiaosong.bsky.social
Yixiao Song
@yixiaosong.bsky.social
Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%
March 12, 2025 at 2:00 PM