#swebench
Can you please point to any examples of such benchmarks that satisfy your criteria?

I think SWEBench comes close for Software engineering, but I don't know much about other areas.
December 15, 2024 at 11:24 AM
@swyx https://x.com/swyx/status/1911844000858079716 #x-swyx

why does 4.1 do so well on SWEbench but not on GPQA?

I think there's some interesting insights on reasoning models to be had, by diffing through
- where GPT4.1 is near-o1-level (but 7.5x cheaper)
- ...
April 14, 2025 at 6:30 PM
Claude Opus 4.5 just crushed the SWE‑bench, topping 7 of 8 languages and beating Sonnet 4.5 by 15%. From Java to Python, it’s the new multilingual coding champ. Dive into the details! #ClaudeOpus45 #SWEbench #AIcoding

🔗 aidailypost.com/news/claude-...
November 26, 2025 at 8:53 AM
💡 Summary by GPT3:

Aideは、swebench-lite上で動作するエージェントフレームワークによって駆動されたオープンソースのAIネイティブIDEです。このIDEは、プロアクティブな修正、開発者の制御、迅速な呼び出し、深い推論、そして高速な編集などの機能を備えたAI支援コード編集を可能にします。このIDEはWindows、MacOS、Linuxプラットフォームでダウンロード可能であり、ユーザーは開発者にフィードバックを提供して製品の将来を形作ることができます。
November 10, 2024 at 1:43 PM
SOTA on swebench-verified: relearning the bitter lesson
https://aide.dev/blog/sota-bitter-lesson
January 8, 2025 at 11:35 PM
o3 evals on FrontierMath being done pass@1 is quite honestly the most disturbing aspect of this. Why aren't more people talking about it? And also that we're likely close being able to have end to end AI-driven AI research given that SWEBench score??
December 21, 2024 at 1:08 AM
How do AI software engineering agents work?🤔🤖

Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️

newsletter.pragmaticengineer.com/p/ai-coding-...

Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social

#AI #SWEbench #SWEagent
How do AI software engineering agents work?
Coding agents are the latest promising Artificial Intelligence (AI) tool, and an impressive step up from LLMs. This article is a deep dive into them, with the creators of SWE-bench and SWE-agent.
newsletter.pragmaticengineer.com
August 6, 2024 at 9:58 AM
Google launched Gemini 3 Flash - Pro-level intelligence at Flash speed for $0.50/1M tokens

78% on SWE-bench (beating Gemini 3 Pro!), handles 100 function calls simultaneously, 99.7% on AIME 2025

and it's stonger than 3 Pro at SweBench!?
x.com/OfficialLog...
December 18, 2025 at 2:40 AM
🆕 AWS updates Amazon Q Developer's agent, excelling on SWTBench and SWEBench. It speeds up development, provides reliable suggestions, and cuts down on debugging, letting developers innovate. Available in all AWS Regions, access via '/dev' in Visual Studio Code or JetBrains IDE.

#AWS
Amazon Q Developer releases state of the art agent for feature development
Today, AWS announces the update of Amazon Q Developer’s software development agent. This new agent achieves state-of-the-art performance on industry benchmark SWTBench Verified (49%) and sits among the top ranking models on SWEBench Verified (66%). The agent has access to tools for planning and reasoning that use the capacity of advanced models to their fullest. By running in a dedicated environment with built-in access to all the functionalities of a modern IDE, the agent is now able to generate multiple candidate solutions for a given problem, select the most promising one, and return higher quality code to the developer. With this new agent, developers can further accelerate their development team velocity. The update to the agent translates to more reliable suggestions and reduced debugging time for developers. This allows developers to focus on higher-level design and innovation, while the agent handles more routine coding tasks with increased accuracy. The new software development agent for Amazon Q Developer is available in all AWS Regions where Amazon Q is supported. Getting started with the software development agent is simple. Developers can begin using it immediately by typing '/dev' in the Q chat window in Visual Studio Code or JetBrains integrated development environment (IDE) where the Amazon Q Developer plugin is installed. To learn more about Amazon Q, visit the Amazon Q product page or refer to the agent documentation.
aws.amazon.com
April 21, 2025 at 5:40 PM
The latest from @deepseek_ai, deepseek-chat-v3.1, is live in Cline.

It boasts a SWEBench score of 66%, closely rivaling Sonnet 4's 72.7%.

It's context window is 164k and is more affordable at
$0.56/$1.68 per M tokens.
August 21, 2025 at 5:22 PM
GPT‑5.2 Thinking is the new collaborative AI that can code, reason and ship full‑stack web apps end‑to‑end. See how it tackles SWE‑Bench with long‑context and agentic workflows. Dive in! #GPT52Thinking #SWEbench #FullStackAI

🔗 aidailypost.com/news/gpt-52-...
December 16, 2025 at 1:09 PM
The SPICE pipeline now auto‑labels SWE‑Bench data, cutting the cost of labeling 1,000 instances from ~$100,000 to $5.10. It also provides the SPICE Bench set with 6,802 labeled cases from 291 projects. https://getnews.me/spice-introduces-automated-labeling-for-swe-bench-datasets/ #spice #swebench
September 20, 2025 at 11:59 AM
Noticed yesterday that Claude.ai doesn’t regenerate full artifacts anymore. It seems to make a series of edits instead. Wondering if @anthropic.com is using the edit tool described in their SWEBench blog: www.anthropic.com/research/swe...
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
A post for developers about the new Claude 3.5 Sonnet and the SWE-bench eval
www.anthropic.com
December 26, 2024 at 9:12 AM
evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from [3/7 of https://arxiv.org/abs/2505.23836v1]
June 2, 2025 at 6:01 AM
How we use SWEBench and Terminal Bench to evaluate Warp's coding abilities 👇
July 21, 2025 at 11:03 PM
trae-agent with claude 4 hit 75% on swebench-hard recently
July 20, 2025 at 3:20 PM
Qodo Command Enters AI Coding Agent Wars With 71.2% SWE-Bench Score

#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding

winbuzzer.com/2025/08/12/q...
August 12, 2025 at 2:45 PM
🤖 Autonomous coding agents hit 77.2% on SWE-bench, showing real progress. Local models now tackle real RAG tasks on consumer hardware, but rising trust risks mean oversight matters more than ever.

https://aiconnectnews.com/en/2025/10/agentic-coding-hits-77-on-swe-bench #agentic #swebench
Agentic Coding Hits 77.2% on SWE-bench as Trust Risks Rise
The edge shifts to practical local RAG while legal and dependency risks mount.
aiconnectnews.com
October 14, 2025 at 8:31 AM
DeepSWE: Rewolucja w kodowaniu dzięki uczeniu wzmocnionemu

Together AI wprowadza DeepSWE, otwartego agenta oprogramowania, który dzięki uczeniu wzmocnionemu (RL) osiąga 59% skuteczności w teście SWEBench, wyznaczając nowy kierunek rozwoju agentów AI.
DeepSWE: Rewolucja w kodowaniu dzięki uczeniu wzmocnionemu
Together AI wprowadza DeepSWE, otwartego agenta oprogramowania, który dzięki uczeniu wzmocnionemu (RL) osiąga 59% skuteczności w teście SWEBench, wyznaczając nowy kierunek rozwoju agentów AI.
aisight.pl
July 4, 2025 at 7:21 PM
SOTA on swebench-verified: (re)learn the bitter lesson

Aide is now SOTA on swebench-verified, solving 62.2% of benchmark issues. We do this by scaling our inference agent at test time and relearning the bitter lesson. > The biggest lesson to be learned from 70 years of AI research is that general…
SOTA on swebench-verified: (re)learn the bitter lesson
Aide is now SOTA on swebench-verified, solving 62.2% of benchmark issues. We do this by scaling our inference agent at test time and relearning the bitter lesson. > The biggest lesson to be learned from 70 years of AI research is that general methods that use computation are the most effective, and by a large margin. In the midst of this exploration, we also developed a…
top4all.net
January 8, 2025 at 10:39 PM