Lightnews — Scholar-powered news

klieret.bsky.social

@klieret.bsky.social

This analysis was conducted with mini-swe-agent. It's open source and the documentation tells you exactly how to reproduce our numbers. github.com/SWE-agent/mi...

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench v...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

github.com

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

You can find all of the trajectories here: klieret.short.gy/mini-traject...

Docent

AI-powered evaluation framework

klieret.short.gy

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

github.com/SWE-agent/mi...

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench v...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

github.com

September 11, 2025 at 3:05 AM

klieret.bsky.social

@klieret.bsky.social

You can find lots of other models evaluated under the same settings at swebench.com (bash-only leaderboard). You can find our agent implementation at github.com/SWE-agent/mi...

SWE-bench Leaderboards

swebench.com

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

This is evaluated with mini-swe-agent (common-sense prompts, no tools other than bash, some 100 lines of code for the agent class): github.com/SWE-agent/mi.... We're still working on evaluating some other open source models (including GLM)

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench v...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

github.com

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

Blog post: www.swebench.com/SWE-bench/bl...

mini-SWE-agent roulette mode: Randomly switching between models at every step can boost performance - SWE-bench

www.swebench.com

August 20, 2025 at 6:02 PM

klieret.bsky.social

@klieret.bsky.social

Our minimal agent: github.com/SWE-agent/mi...

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench v...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

github.com

August 20, 2025 at 6:02 PM

klieret.bsky.social

@klieret.bsky.social

Evaluated with our open source minimal agent github.com/SWE-agent/mi... that tests LMs in a bare-bones shell environment. Agent is implemented in just some 100 lines! We'll add the results to our swe-bench (bash-only) leaderboard shortly: swebench.com

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 68% on SWE-bench ...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-a...

github.com

August 8, 2025 at 3:20 PM

klieret.bsky.social

@klieret.bsky.social

GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.

August 8, 2025 at 3:20 PM

klieret.bsky.social

@klieret.bsky.social

Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!

August 8, 2025 at 3:20 PM

klieret.bsky.social

@klieret.bsky.social

More results in the morning! Run the agent yourself: github.com/SWE-agent/mi...

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 68% on SWE-bench ...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-a...

github.com

August 8, 2025 at 4:13 AM

klieret.bsky.social

@klieret.bsky.social

Everything open source at: github.com/SWE-agent/mi...

GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 65% on SWE-bench ...

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 65% on SWE-bench verified! - SWE-agent/mini-swe-a...

github.com

August 7, 2025 at 9:22 PM

klieret.bsky.social

@klieret.bsky.social

We also made some adjustments to our prompt, in particular the following line: "This workflows should be done step-by-step so that you can iterate on your changes and any possible problems." Without it, gpt-5 often tries to solve everything in one step, then quits.

August 7, 2025 at 9:22 PM

klieret.bsky.social

@klieret.bsky.social

Guide to running with gpt-5: mini-swe-agent.com/latest/quick... The extra step is necessary because gpt-5 prices aren't registered in litellm yet. Also expect long run times (the 5 steps above took more than 5 minutes).

Quick start - mini-SWE-agent documentation

mini-swe-agent.com

August 7, 2025 at 9:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news