I think SWEBench comes close for Software engineering, but I don't know much about other areas.
I think SWEBench comes close for Software engineering, but I don't know much about other areas.
Coding Agent Based on Qwen3-32B and Achieves 59% on
SWEBench >> Comment below! #industry40 #AI #mhealth #IoT #healthtech
Coding Agent Based on Qwen3-32B and Achieves 59% on
SWEBench >> Comment below! #industry40 #AI #mhealth #IoT #healthtech
why does 4.1 do so well on SWEbench but not on GPQA?
I think there's some interesting insights on reasoning models to be had, by diffing through
- where GPT4.1 is near-o1-level (but 7.5x cheaper)
- ...
why does 4.1 do so well on SWEbench but not on GPQA?
I think there's some interesting insights on reasoning models to be had, by diffing through
- where GPT4.1 is near-o1-level (but 7.5x cheaper)
- ...
Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop
#AI #Anthropic #Claude #GenerativeAI #LLM #AgenticAI #AICoding #SoftwareDevelopment #AIModels #Opus45 #SWEbench #AIEfficiency #Developers
Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop
#AI #Anthropic #Claude #GenerativeAI #LLM #AgenticAI #AICoding #SoftwareDevelopment #AIModels #Opus45 #SWEbench #AIEfficiency #Developers
🔗 aidailypost.com/news/claude-...
🔗 aidailypost.com/news/claude-...
Aideは、swebench-lite上で動作するエージェントフレームワークによって駆動されたオープンソースのAIネイティブIDEです。このIDEは、プロアクティブな修正、開発者の制御、迅速な呼び出し、深い推論、そして高速な編集などの機能を備えたAI支援コード編集を可能にします。このIDEはWindows、MacOS、Linuxプラットフォームでダウンロード可能であり、ユーザーは開発者にフィードバックを提供して製品の将来を形作ることができます。
Aideは、swebench-lite上で動作するエージェントフレームワークによって駆動されたオープンソースのAIネイティブIDEです。このIDEは、プロアクティブな修正、開発者の制御、迅速な呼び出し、深い推論、そして高速な編集などの機能を備えたAI支援コード編集を可能にします。このIDEはWindows、MacOS、Linuxプラットフォームでダウンロード可能であり、ユーザーは開発者にフィードバックを提供して製品の将来を形作ることができます。
https://aide.dev/blog/sota-bitter-lesson
https://aide.dev/blog/sota-bitter-lesson
Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️
newsletter.pragmaticengineer.com/p/ai-coding-...
Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social
#AI #SWEbench #SWEagent
Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️
newsletter.pragmaticengineer.com/p/ai-coding-...
Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social
#AI #SWEbench #SWEagent
78% on SWE-bench (beating Gemini 3 Pro!), handles 100 function calls simultaneously, 99.7% on AIME 2025
and it's stonger than 3 Pro at SweBench!?
x.com/OfficialLog...
78% on SWE-bench (beating Gemini 3 Pro!), handles 100 function calls simultaneously, 99.7% on AIME 2025
and it's stonger than 3 Pro at SweBench!?
x.com/OfficialLog...
#AWS
#AWS
It boasts a SWEBench score of 66%, closely rivaling Sonnet 4's 72.7%.
It's context window is 164k and is more affordable at
$0.56/$1.68 per M tokens.
It boasts a SWEBench score of 66%, closely rivaling Sonnet 4's 72.7%.
It's context window is 164k and is more affordable at
$0.56/$1.68 per M tokens.
🔗 aidailypost.com/news/gpt-52-...
🔗 aidailypost.com/news/gpt-52-...
bayes.net/swebench-hack/
bayes.net/swebench-hack/
#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding
winbuzzer.com/2025/08/12/q...
#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding
winbuzzer.com/2025/08/12/q...
https://aiconnectnews.com/en/2025/10/agentic-coding-hits-77-on-swe-bench #agentic #swebench
https://aiconnectnews.com/en/2025/10/agentic-coding-hits-77-on-swe-bench #agentic #swebench
Together AI wprowadza DeepSWE, otwartego agenta oprogramowania, który dzięki uczeniu wzmocnionemu (RL) osiąga 59% skuteczności w teście SWEBench, wyznaczając nowy kierunek rozwoju agentów AI.
Together AI wprowadza DeepSWE, otwartego agenta oprogramowania, który dzięki uczeniu wzmocnionemu (RL) osiąga 59% skuteczności w teście SWEBench, wyznaczając nowy kierunek rozwoju agentów AI.
Aide is now SOTA on swebench-verified, solving 62.2% of benchmark issues. We do this by scaling our inference agent at test time and relearning the bitter lesson. > The biggest lesson to be learned from 70 years of AI research is that general…
Aide is now SOTA on swebench-verified, solving 62.2% of benchmark issues. We do this by scaling our inference agent at test time and relearning the bitter lesson. > The biggest lesson to be learned from 70 years of AI research is that general…