Lightnews — Scholar-powered news

Jatin Ganhotra

@jatinganhotra.dev

100 followers 420 following 52 posts

Researcher at @IBMResearch #NLProc #ConvAI #Agents #SWE-Agent || RTs ≠ endorsements. Views personal, not of employers/institutions.

jatinganhotra.dev swebencharena.com

Posts Replies Media Videos

Jatin Ganhotra

@jatinganhotra.dev

Fascinating finding: When you remove the 156 problems that 61+ agents solve, performance drops dramatically

Top agents: 73% → 11%

This isn't about making things harder - it's about measuring what matters 🎯

jatinganhotra.dev/blog/swe-age...

June 17, 2025 at 7:47 PM

Jatin Ganhotra

@jatinganhotra.dev

5/ The results are eye-opening:

Claude 4 Opus on full benchmark: 73.2% ✅
Claude 4 Opus on Frontier subset: 11.6% 😬

This isn't just harder - it's revealing what agents ACTUALLY can't do

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

4/ Solution: 4 targeted subsets that reveal true agent capabilities

Each subset targets different evaluation needs - from maximum sensitivity (Frontier) to real-world complexity (MultiFile)

Performance drops from 73% to as low as 10%!

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

3/ I analyzed all 500 problems against 83 different SWE-agents

The distribution is shocking:
- 52 problems: ZERO agents can solve
- 26 problems: Only 1-2 agents succeed
- 156 problems: 61+ agents solve easily

June 6, 2025 at 8:05 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news