jatinganhotra.dev swebencharena.com
Top agents: 73% → 11%
This isn't about making things harder - it's about measuring what matters 🎯
jatinganhotra.dev/blog/swe-age...
Top agents: 73% → 11%
This isn't about making things harder - it's about measuring what matters 🎯
jatinganhotra.dev/blog/swe-age...
Claude 4 Opus on full benchmark: 73.2% ✅
Claude 4 Opus on Frontier subset: 11.6% 😬
This isn't just harder - it's revealing what agents ACTUALLY can't do
Claude 4 Opus on full benchmark: 73.2% ✅
Claude 4 Opus on Frontier subset: 11.6% 😬
This isn't just harder - it's revealing what agents ACTUALLY can't do
Each subset targets different evaluation needs - from maximum sensitivity (Frontier) to real-world complexity (MultiFile)
Performance drops from 73% to as low as 10%!
Each subset targets different evaluation needs - from maximum sensitivity (Frontier) to real-world complexity (MultiFile)
Performance drops from 73% to as low as 10%!
The distribution is shocking:
- 52 problems: ZERO agents can solve
- 26 problems: Only 1-2 agents succeed
- 156 problems: 61+ agents solve easily
The distribution is shocking:
- 52 problems: ZERO agents can solve
- 26 problems: Only 1-2 agents succeed
- 156 problems: 61+ agents solve easily