klieret.bsky.social
@klieret.bsky.social
By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.
September 30, 2025 at 2:49 PM
Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice
September 30, 2025 at 2:49 PM
We evaluated Anthropic's Sonnet 4.5 with our minimal agent. New record on SWE-bench verified: 70.6%! Same price/token as Sonnet 4, but takes more steps, ending up being more expensive. Cost analysis details & link to full trajectories in 🧵
September 30, 2025 at 2:49 PM
The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.
August 21, 2025 at 10:34 PM
Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models
August 21, 2025 at 10:34 PM
Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
August 21, 2025 at 10:34 PM
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
August 20, 2025 at 6:02 PM
GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.
August 8, 2025 at 3:20 PM
Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!
August 8, 2025 at 3:20 PM
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵
August 8, 2025 at 3:20 PM
Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!
August 7, 2025 at 9:22 PM