klieret.bsky.social
@klieret.bsky.social
This analysis was conducted with mini-swe-agent. It's open source and the documentation tells you exactly how to reproduce our numbers. github.com/SWE-agent/mi...
GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench v...
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent
github.com
September 30, 2025 at 2:49 PM
You can find all of the trajectories here: klieret.short.gy/mini-traject...
Docent
AI-powered evaluation framework
klieret.short.gy
September 30, 2025 at 2:49 PM
By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.
September 30, 2025 at 2:49 PM
Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice
September 30, 2025 at 2:49 PM
You can find lots of other models evaluated under the same settings at swebench.com (bash-only leaderboard). You can find our agent implementation at github.com/SWE-agent/mi...
SWE-bench Leaderboards
swebench.com
August 21, 2025 at 10:34 PM
The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.
August 21, 2025 at 10:34 PM
Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models
August 21, 2025 at 10:34 PM
This is evaluated with mini-swe-agent (common-sense prompts, no tools other than bash, some 100 lines of code for the agent class): github.com/SWE-agent/mi.... We're still working on evaluating some other open source models (including GLM)
GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench v...
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent
github.com
August 21, 2025 at 10:34 PM
Evaluated with our open source minimal agent github.com/SWE-agent/mi... that tests LMs in a bare-bones shell environment. Agent is implemented in just some 100 lines! We'll add the results to our swe-bench (bash-only) leaderboard shortly: swebench.com
GitHub - SWE-agent/mini-swe-agent: The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 68% on SWE-bench ...
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no crazy configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-a...
github.com
August 8, 2025 at 3:20 PM
GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.
August 8, 2025 at 3:20 PM
Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!
August 8, 2025 at 3:20 PM
We also made some adjustments to our prompt, in particular the following line: "This workflows should be done step-by-step so that you can iterate on your changes and any possible problems." Without it, gpt-5 often tries to solve everything in one step, then quits.
August 7, 2025 at 9:22 PM
Guide to running with gpt-5: mini-swe-agent.com/latest/quick... The extra step is necessary because gpt-5 prices aren't registered in litellm yet. Also expect long run times (the 5 steps above took more than 5 minutes).
Quick start - mini-SWE-agent documentation
mini-swe-agent.com
August 7, 2025 at 9:22 PM