Lightnews — Scholar-powered news

Yixiao Song

@yixiaosong.bsky.social

We continuously update BEARCUBS with challenging questions. If you're developing web agents, use BEARCUBS to benchmark their real-world performance! 🚀

Work done with Katherine Thai
@chautmpham.bsky.social @yapeichang.bsky.social Mazin Nadaf & @miyyer.bsky.social

🌐 bear-cubs.github.io/

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

Future computer-use agents should be enhanced with:

💡 Stronger multimodal reasoning (videos, maps, real-time data)
🔍 More reliable source selection
🗺️ Smarter and more efficient search strategies
📜 Transparent and interpretable browsing trajectories

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

❌ No agent excels at video, images, or interactive web content.

Current agents struggle with:
🚨 Selecting reliable sources
🚨 Escaping dead loops
🚨 Engaging in multimodal interactions
🚨 Navigating the web in real-time

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

🐻 BEARCUBS 🐻 questions aren't easy! Humans achieve 84.7% accuracy. How well do web agents perform? 🤔

Not great ...
🥴 The best multimodal web agent, OpenAI’s Operator, scores 24.3% accuracy.
🤯 OpenAI’s Deep Research outperforms all (35.1%), without computer-use abilities!

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

Why a new web agent benchmark? Cuz popular ones👇
1️⃣ Use simulations (e.g., WebArena), missing real-world complexity
2️⃣ Have limited multimodal testing, relying on HTML (Mind2Web) or specific skills (e.g., map)
3️⃣ Are nearing performance saturation—Operator hits 87% on WebVoyager

March 12, 2025 at 2:00 PM

Yixiao Song

@yixiaosong.bsky.social

BEARCUBS 👇
🔹Benchmarks computer-using agents @OpenAI Operator, @AnthropicAI Computer Use, and @convergence_ai_ Proxy
🔹Evaluates complex text-based & multimodal interactions
🔹Will be updated regularly with new questions

📜 arxiv.org/abs/2503.07919
🌐 bear-cubs.github.io/

March 12, 2025 at 2:00 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news