Lightnews — Scholar-powered news

Jatin Ganhotra

@jatinganhotra.dev

4/4 Ready to see how AI really stacks up against human developers?

Join researchers and developers already evaluating patches → swebencharena.com

#AI #SoftwareEngineering #CodeQuality #AIEvaluation #SWEBenchArena

September 15, 2025 at 4:06 AM

Jatin Ganhotra

@jatinganhotra.dev

3/4 Unlike other platforms:
🚫 PR Arena: Tracks merge rates, not code quality
🚫 Yupp AI: Known models, not blind
🚫 SWE Arena: General coding, not SWE tasks

✅ SWE-Bench-Arena: Blind quality evaluation of real bug fixes

September 15, 2025 at 4:06 AM

Jatin Ganhotra

@jatinganhotra.dev

2/4 SWE-Bench-Arena fills this gap with blind evaluation across 5 dimensions:
• Simplicity
• Readability
• Performance
• Maintainability
• Correctness

No bias. Just quality assessment.

September 15, 2025 at 4:06 AM

Jatin Ganhotra

@jatinganhotra.dev

Try evaluating patches → swebencharena.com

What quality issues have you noticed with AI-generated code?

#AIEvaluation #SWEBenchArena #CodeQuality #AI #SoftwareEngineering

September 4, 2025 at 3:01 AM

Jatin Ganhotra

@jatinganhotra.dev

We need diverse perspectives from:
🎓 AI researchers
👩‍💻 Professional developers
📚 Academic teams
🚀 Startup engineers

Your input shapes the future of AI code evaluation standards.

September 4, 2025 at 3:01 AM

Jatin Ganhotra

@jatinganhotra.dev

How it works:
• Real GitHub issues from actual projects
• Side-by-side patch comparison
• Blind evaluation (you don't know which is AI vs human)
• Multi-dimensional quality assessment

Early results are fascinating - some AI solutions are surprisingly elegant, others create hidden technical debt 📊

September 4, 2025 at 3:01 AM

Jatin Ganhotra

@jatinganhotra.dev

That's why we built SWE-Bench-Arena - the first blind evaluation platform for AI code quality.

Instead of just "does it work?", we ask:
✅ Is it maintainable?
✅ Will teams understand it?
✅ Does it follow best practices?
✅ Is it unnecessarily complex?

September 4, 2025 at 3:01 AM

Jatin Ganhotra

@jatinganhotra.dev

5. I call it the Visual Complexity Penalty — and I break it down in detail in my latest post:
🔗 jatinganhotra.dev/blog/swe-age...
📊 Includes full leaderboard analysis, complexity breakdown, and takeaways.

RT if you're building SWE agents — or trying to understand their real limits.

The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis | Jatin Ganhotra

Analyzing how visual content dramatically impacts AI agents' performance on SWE tasks

jatinganhotra.dev

July 27, 2025 at 11:00 PM

Jatin Ganhotra

@jatinganhotra.dev

4. This isn't a benchmark artifact.
It's a wake-up call.
🧠 Current AI systems cannot effectively combine visual + structural code understanding.
And that's a serious problem for real-world software workflows.

July 27, 2025 at 11:00 PM

Jatin Ganhotra

@jatinganhotra.dev

3. It's not just the images.
Multimodal tasks often require multi-file edits and focus on JavaScript-based, user-facing applications rather than Python backends.
The combination of visual reasoning + frontend complexity is devastating.

July 27, 2025 at 11:00 PM

Jatin Ganhotra

@jatinganhotra.dev

2. Why the collapse?
📸 90.6% of instances in SWE-bench Multimodal contain visual content.
When images are present, solve rates drop from ~100% to ~25% across all top-performing agents.

July 27, 2025 at 11:00 PM

Jatin Ganhotra

@jatinganhotra.dev

1. SWE agents are getting better. Some achieve 70-75% accuracy on code-only benchmarks like SWE-bench Verified.
But when the same models are tested on SWE-bench Multimodal, scores fall to ~30%.

July 27, 2025 at 11:00 PM

Jatin Ganhotra

@jatinganhotra.dev

huggingface.co/spaces/jatin...

SWE-Bench Verified Discriminative Subsets Leaderboard - a Hugging Face Space by jatinganhotra

This application shows the SWE-Bench leaderboard and automatically updates it with the latest data. No input is required; you just need to run the app, and it will provide you with the current lead...

huggingface.co

July 21, 2025 at 7:24 PM

Jatin Ganhotra

@jatinganhotra.dev

Full analysis: jatinganhotra.dev/blog/swe-age...

From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets | Jatin Ganhotra

jatinganhotra.dev

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

6/ Ready to benchmark YOUR agent properly?

Dataset available now:
🤗 huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative

Stop optimizing for saturated benchmarks. Start measuring real progress.

jatinganhotra/SWE-bench_Verified-discriminative · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

5/ The results are eye-opening:

Claude 4 Opus on full benchmark: 73.2% ✅
Claude 4 Opus on Frontier subset: 11.6% 😬

This isn't just harder - it's revealing what agents ACTUALLY can't do

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

4/ Solution: 4 targeted subsets that reveal true agent capabilities

Each subset targets different evaluation needs - from maximum sensitivity (Frontier) to real-world complexity (MultiFile)

Performance drops from 73% to as low as 10%!

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

3/ I analyzed all 500 problems against 83 different SWE-agents

The distribution is shocking:
- 52 problems: ZERO agents can solve
- 26 problems: Only 1-2 agents succeed
- 156 problems: 61+ agents solve easily

June 6, 2025 at 8:05 PM

Jatin Ganhotra

@jatinganhotra.dev

2/ The problem: 156/500 problems are solved by 61+ agents

When everyone gets the same questions right, you can't tell who's actually better @anthropic.com

It's like ranking students when everyone scores 95%+ on the easy questions

June 6, 2025 at 8:05 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news