Current agents struggle with:
🚨 Selecting reliable sources
🚨 Escaping dead loops
🚨 Engaging in multimodal interactions
🚨 Navigating the web in real-time
Current agents struggle with:
🚨 Selecting reliable sources
🚨 Escaping dead loops
🚨 Engaging in multimodal interactions
🚨 Navigating the web in real-time
Not great ...
🥴 The best multimodal web agent, OpenAI’s Operator, scores 24.3% accuracy.
🤯 OpenAI’s Deep Research outperforms all (35.1%), without computer-use abilities!
Not great ...
🥴 The best multimodal web agent, OpenAI’s Operator, scores 24.3% accuracy.
🤯 OpenAI’s Deep Research outperforms all (35.1%), without computer-use abilities!
🔹Benchmarks computer-using agents @OpenAI Operator, @AnthropicAI Computer Use, and @convergence_ai_ Proxy
🔹Evaluates complex text-based & multimodal interactions
🔹Will be updated regularly with new questions
📜 arxiv.org/abs/2503.07919
🌐 bear-cubs.github.io/
🔹Benchmarks computer-using agents @OpenAI Operator, @AnthropicAI Computer Use, and @convergence_ai_ Proxy
🔹Evaluates complex text-based & multimodal interactions
🔹Will be updated regularly with new questions
📜 arxiv.org/abs/2503.07919
🌐 bear-cubs.github.io/
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%