Lightnews — Scholar-powered news

Yixiao Song

@yixiaosong.bsky.social

11 followers 6 following 7 posts

Posts Replies Media Videos

Yixiao Song

@yixiaosong.bsky.social

❌ No agent excels at video, images, or interactive web content.

Current agents struggle with:
🚨 Selecting reliable sources
🚨 Escaping dead loops
🚨 Engaging in multimodal interactions
🚨 Navigating the web in real-time

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

🐻 BEARCUBS 🐻 questions aren't easy! Humans achieve 84.7% accuracy. How well do web agents perform? 🤔

Not great ...
🥴 The best multimodal web agent, OpenAI’s Operator, scores 24.3% accuracy.
🤯 OpenAI’s Deep Research outperforms all (35.1%), without computer-use abilities!

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

BEARCUBS 👇
🔹Benchmarks computer-using agents @OpenAI Operator, @AnthropicAI Computer Use, and @convergence_ai_ Proxy
🔹Evaluates complex text-based & multimodal interactions
🔹Will be updated regularly with new questions

📜 arxiv.org/abs/2503.07919
🌐 bear-cubs.github.io/

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

Introducing 🐻 BEARCUBS 🐻, a “small but mighty” dataset of 111 QA pairs designed to assess computer-using web agents in multimodal interactions on the live web!
✅ Humans achieve 85% accuracy
❌ OpenAI Operator: 24%
❌ Anthropic Computer Use: 14%
❌ Convergence AI Proxy: 13%

March 12, 2025 at 2:00 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news