jatinganhotra.dev swebencharena.com
🍰Reinforcement Learning environments for LLMs
🐎Speculative and non-auto regressive generation for LLMs
interested/curious? DM or email ramon.astudillo@ibm.com
🍰Reinforcement Learning environments for LLMs
🐎Speculative and non-auto regressive generation for LLMs
interested/curious? DM or email ramon.astudillo@ibm.com
Claude 4 Sonnet hits 72.7% on SWE-Bench, but industry data shows code clones rose 48% (8.3% to 12.3%) and refactoring rates dropped from 25% to 10% since AI adoption.
(GitClear: gitclear.com/ai_assistant_code_quality_2025_research)
Claude 4 Sonnet hits 72.7% on SWE-Bench, but industry data shows code clones rose 48% (8.3% to 12.3%) and refactoring rates dropped from 25% to 10% since AI adoption.
(GitClear: gitclear.com/ai_assistant_code_quality_2025_research)
Recent data shows concerning trends since AI adoption:
• 48% increase in code cloning
• Refactoring dropped from 25% to 10%
• Developers report "missing context" as #1 issue
Are we optimizing for the wrong metrics? 🧵
Recent data shows concerning trends since AI adoption:
• 48% increase in code cloning
• Refactoring dropped from 25% to 10%
• Developers report "missing context" as #1 issue
Are we optimizing for the wrong metrics? 🧵
AI agents collapse under visual complexity.
A 73.2% performance drop when images are introduced in SWE-bench Multimodal.
Here's why this matters — and what it tells us about the future of AI in software engineering:
🧵👇
AI agents collapse under visual complexity.
A 73.2% performance drop when images are introduced in SWE-bench Multimodal.
Here's why this matters — and what it tells us about the future of AI in software engineering:
🧵👇
SWE-Bench Verified shows 73% success rates, but focusing on discriminative subsets reveals a different story: 11%
What really challenges AI agents? Analysis: jatinganhotra.dev/blog/swe-age...
SWE-Bench Verified shows 73% success rates, but focusing on discriminative subsets reveals a different story: 11%
What really challenges AI agents? Analysis: jatinganhotra.dev/blog/swe-age...
Top agents: 73% → 11%
This isn't about making things harder - it's about measuring what matters 🎯
jatinganhotra.dev/blog/swe-age...
Top agents: 73% → 11%
This isn't about making things harder - it's about measuring what matters 🎯
jatinganhotra.dev/blog/swe-age...
SWE-Bench Verified has driven amazing progress, but with most agents solving 350+ same problems, we need new targets @ofirpress.bsky.social
Enter: discriminative subsets that highlight genuine challenges 🧵
SWE-Bench Verified has driven amazing progress, but with most agents solving 350+ same problems, we need new targets @ofirpress.bsky.social
Enter: discriminative subsets that highlight genuine challenges 🧵
I've compiled all my analysis here: 🧵
I've compiled all my analysis here: 🧵
SWE-Agents can now generate code to resolve GitHub issues—but how do we ensure these fixes are robust and reliable? Introducing **Otter & Otter++**, two innovative test generation approaches leveraging LLMs with self-reflective action planning. 🦦🔍
SWE-Agents can now generate code to resolve GitHub issues—but how do we ensure these fixes are robust and reliable? Introducing **Otter & Otter++**, two innovative test generation approaches leveraging LLMs with self-reflective action planning. 🦦🔍
Do you work on RAG? Are you interested in Multi-Turn conversations? Very excited to share the new MTRAG benchmark we have released!
Data: github.com/ibm/mt-rag-b...
Paper: arxiv.org/abs/2501.03468
Do you work on RAG? Are you interested in Multi-Turn conversations? Very excited to share the new MTRAG benchmark we have released!
Data: github.com/ibm/mt-rag-b...
Paper: arxiv.org/abs/2501.03468
jatinganhotra.github.io/blog/swe-age...
jatinganhotra.github.io/blog/swe-age...