Lightnews — Scholar-powered news

Lakshya A Agrawal

@lakshyaaagrawal.bsky.social

Can you please point to any examples of such benchmarks that satisfy your criteria?

I think SWEBench comes close for Software engineering, but I don't know much about other areas.

December 15, 2024 at 11:24 AM

drtimos.bsky.social

@drtimos.bsky.social

Thoughts on this? >> Together AI Releases DeepSWE: A Fully Open-Source RL-Trained
Coding Agent Based on Qwen3-32B and Achieves 59% on
SWEBench >> Comment below! #industry40 #AI #mhealth #IoT #healthtech

Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Agent Based on Qwen3-32B and Achieves 59% on SWEBench

Together AI has released DeepSWE, a state-of-the-art, fully open-sourced software engineering agent that is trained entirely through reinforcement learning (RL). Built on top of the Qwen3-32B language model, DeepSWE achieves 59% accuracy on the SWEBench-Verified benchmark and 42.2% Pass@1, topping the leaderboard among open-weight models. This launch represents a significant shift for Together AI, from […] The post Together AI Releases DeepSWE: A Fully Open-Source RL-Trained Coding Agent Based on Qwen3-32B and Achieves 59% on SWEBench appeared first on MarkTechPost.

dlvr.it

July 3, 2025 at 6:41 AM

Hacker News Top 100 Bot (Unofficial)

@hn100.bsky.social

SOTA on swebench-verified: relearning the bitter lesson

Discussion

January 9, 2025 at 6:54 AM

X Bot

@handle.invalid

@swyx https://x.com/swyx/status/1911844000858079716 #x-swyx

why does 4.1 do so well on SWEbench but not on GPQA?

I think there's some interesting insights on reasoning models to be had, by diffing through
- where GPT4.1 is near-o1-level (but 7.5x cheaper)
- ...

April 14, 2025 at 6:30 PM

Winbuzzer

@winbuzzer.com

winbuzzer.com/2025/11/24/a...

Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop

#AI #Anthropic #Claude #GenerativeAI #LLM #AgenticAI #AICoding #SoftwareDevelopment #AIModels #Opus45 #SWEbench #AIEfficiency #Developers

Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop - WinBuzzer

Anthropic has released Claude Opus 4.5, claiming an industry-leading 80.9% coding score and introducing "Tool Search" with a promise to reduce agent costs by 85%.

winbuzzer.com

November 24, 2025 at 7:53 PM

AI Daily Post

@aidailypost.com

Claude Opus 4.5 just crushed the SWE‑bench, topping 7 of 8 languages and beating Sonnet 4.5 by 15%. From Java to Python, it’s the new multilingual coding champ. Dive into the details! #ClaudeOpus45 #SWEbench #AIcoding

🔗 aidailypost.com/news/claude-...

November 26, 2025 at 8:53 AM

Hacker News JP 🤖

@hacker-news-jp.bsky.social

💡 Summary by GPT3:

Aideは、swebench-lite上で動作するエージェントフレームワークによって駆動されたオープンソースのAIネイティブIDEです。このIDEは、プロアクティブな修正、開発者の制御、迅速な呼び出し、深い推論、そして高速な編集などの機能を備えたAI支援コード編集を可能にします。このIDEはWindows、MacOS、Linuxプラットフォームでダウンロード可能であり、ユーザーは開発者にフィードバックを提供して製品の将来を形作ることができます。

November 10, 2024 at 1:43 PM

Hacker News

@mm-hacker-news.bsky.social

SOTA on swebench-verified: relearning the bitter lesson
https://aide.dev/blog/sota-bitter-lesson

January 8, 2025 at 11:35 PM

Noor

@renoormalize.bsky.social

o3 evals on FrontierMath being done pass@1 is quite honestly the most disturbing aspect of this. Why aren't more people talking about it? And also that we're likely close being able to have end to end AI-driven AI research given that SWEBench score??

December 21, 2024 at 1:08 AM

marmelab

@marmelab.bsky.social

How do AI software engineering agents work?🤔🤖

Find the answer, along with valuable insights from the creators of SWE-bench & SWE-agent, in this article⬇️

newsletter.pragmaticengineer.com/p/ai-coding-...

Great read! 👏 @gergely.pragmaticengineer.com @hejelin.bsky.social

#AI #SWEbench #SWEagent

How do AI software engineering agents work?

Coding agents are the latest promising Artificial Intelligence (AI) tool, and an impressive step up from LLMs. This article is a deep dive into them, with the creators of SWE-bench and SWE-agent.

newsletter.pragmaticengineer.com

August 6, 2024 at 9:58 AM

Alex Volkov (Thursd/AI)

@altryne.bsky.social

Google launched Gemini 3 Flash - Pro-level intelligence at Flash speed for $0.50/1M tokens

78% on SWE-bench (beating Gemini 3 Pro!), handles 100 function calls simultaneously, 99.7% on AIME 2025

and it's stonger than 3 Pro at SweBench!?
x.com/OfficialLog...

December 18, 2025 at 2:40 AM

AWS News Feed on 🦋

@awsrecentnews.bsky.social

🆕 AWS updates Amazon Q Developer's agent, excelling on SWTBench and SWEBench. It speeds up development, provides reliable suggestions, and cuts down on debugging, letting developers innovate. Available in all AWS Regions, access via '/dev' in Visual Studio Code or JetBrains IDE.

#AWS

Amazon Q Developer releases state of the art agent for feature development

Today, AWS announces the update of Amazon Q Developer’s software development agent. This new agent achieves state-of-the-art performance on industry benchmark SWTBench Verified (49%) and sits among the top ranking models on SWEBench Verified (66%). The agent has access to tools for planning and reasoning that use the capacity of advanced models to their fullest. By running in a dedicated environment with built-in access to all the functionalities of a modern IDE, the agent is now able to generate multiple candidate solutions for a given problem, select the most promising one, and return higher quality code to the developer. With this new agent, developers can further accelerate their development team velocity. The update to the agent translates to more reliable suggestions and reduced debugging time for developers. This allows developers to focus on higher-level design and innovation, while the agent handles more routine coding tasks with increased accuracy. The new software development agent for Amazon Q Developer is available in all AWS Regions where Amazon Q is supported. Getting started with the software development agent is simple. Developers can begin using it immediately by typing '/dev' in the Q chat window in Visual Studio Code or JetBrains integrated development environment (IDE) where the Amazon Q Developer plugin is installed. To learn more about Amazon Q, visit the Amazon Q product page or refer to the agent documentation.

aws.amazon.com

April 21, 2025 at 5:40 PM

Cline

@thankscline.bsky.social

The latest from @deepseek_ai, deepseek-chat-v3.1, is live in Cline.

It boasts a SWEBench score of 66%, closely rivaling Sonnet 4's 72.7%.

It's context window is 164k and is more affordable at
$0.56/$1.68 per M tokens.

August 21, 2025 at 5:22 PM

AI Daily Post

@aidailypost.com

GPT‑5.2 Thinking is the new collaborative AI that can code, reason and ship full‑stack web apps end‑to‑end. See how it tackles SWE‑Bench with long‑context and agentic workflows. Dive in! #GPT52Thinking #SWEbench #FullStackAI

🔗 aidailypost.com/news/gpt-52-...

December 16, 2025 at 1:09 PM

GetNews.me

@getnews-me.bsky.social

The SPICE pipeline now auto‑labels SWE‑Bench data, cutting the cost of labeling 1,000 instances from ~$100,000 to $5.10. It also provides the SPICE Bench set with 6,802 labeled cases from 291 projects. https://getnews.me/spice-introduces-automated-labeling-for-swe-bench-datasets/ #spice #swebench

SPICE Introduces Automated Labeling for SWE‑Bench Datasets

September 20, 2025 at 11:59 AM

Tom Adamczewski

@tadamcz.bsky.social

My full notes:
bayes.net/swebench-hack/

Claude 4 hacked SWE-bench by peeking at future commits

I predicted future AI models would someday learn to cheat on SWE-bench. They were already doing it.

bayes.net

September 5, 2025 at 4:02 PM

Mathis Lucka

@ma-this.bsky.social

Noticed yesterday that Claude.ai doesn’t regenerate full artifacts anymore. It seems to make a series of edits instead. Wondering if @anthropic.com is using the edit tool described in their SWEBench blog: www.anthropic.com/research/swe...

Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

A post for developers about the new Claude 3.5 Sonnet and the SWE-bench eval

www.anthropic.com

December 26, 2024 at 9:12 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from [3/7 of https://arxiv.org/abs/2505.23836v1]

June 2, 2025 at 6:01 AM

Warp

@warp.dev

How we use SWEBench and Terminal Bench to evaluate Warp's coding abilities 👇

July 21, 2025 at 11:03 PM

Thomas Wood

@advanced-eschatonics.com

trae-agent with claude 4 hit 75% on swebench-hard recently

July 20, 2025 at 3:20 PM

Winbuzzer

@winbuzzer.com

Qodo Command Enters AI Coding Agent Wars With 71.2% SWE-Bench Score

#AI #SWEbench #Qodo #OpenAI #Anthropic #GPT5 #Coding

winbuzzer.com/2025/08/12/q...

August 12, 2025 at 2:45 PM

Hacker News 20

@betterhn20.e-work.xyz

SOTA on swebench-verified: relearning the bitter lesson https://aide.dev/blog/sota-bitter-lesson (https://news.ycombinator.com/item?id=42638605)

SOTA on swebench-verified: (re)learning the bitter lesson

Searching code is an important part of every developer's workflow. We're trying to make it better.

aide.dev

January 9, 2025 at 1:23 AM

AI Connect News

@aiconnectnews.bsky.social

🤖 Autonomous coding agents hit 77.2% on SWE-bench, showing real progress. Local models now tackle real RAG tasks on consumer hardware, but rising trust risks mean oversight matters more than ever.

https://aiconnectnews.com/en/2025/10/agentic-coding-hits-77-on-swe-bench #agentic #swebench

Agentic Coding Hits 77.2% on SWE-bench as Trust Risks Rise

The edge shifts to practical local RAG while legal and dependency risks mount.

aiconnectnews.com

October 14, 2025 at 8:31 AM

ai-sight.bsky.social

@ai-sight.bsky.social

DeepSWE: Rewolucja w kodowaniu dzięki uczeniu wzmocnionemu

Together AI wprowadza DeepSWE, otwartego agenta oprogramowania, który dzięki uczeniu wzmocnionemu (RL) osiąga 59% skuteczności w teście SWEBench, wyznaczając nowy kierunek rozwoju agentów AI.

DeepSWE: Rewolucja w kodowaniu dzięki uczeniu wzmocnionemu

Together AI wprowadza DeepSWE, otwartego agenta oprogramowania, który dzięki uczeniu wzmocnionemu (RL) osiąga 59% skuteczności w teście SWEBench, wyznaczając nowy kierunek rozwoju agentów AI.

aisight.pl

July 4, 2025 at 7:21 PM

Top 4 All

@top4all.bsky.social

SOTA on swebench-verified: (re)learn the bitter lesson

Aide is now SOTA on swebench-verified, solving 62.2% of benchmark issues. We do this by scaling our inference agent at test time and relearning the bitter lesson. > The biggest lesson to be learned from 70 years of AI research is that general…

SOTA on swebench-verified: (re)learn the bitter lesson

Aide is now SOTA on swebench-verified, solving 62.2% of benchmark issues. We do this by scaling our inference agent at test time and relearning the bitter lesson. > The biggest lesson to be learned from 70 years of AI research is that general methods that use computation are the most effective, and by a large margin. In the midst of this exploration, we also developed a…

top4all.net

January 8, 2025 at 10:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news