Lightnews — Scholar-powered news

klieret.bsky.social

@klieret.bsky.social

By varying the agent step limit, you can get some control over the cost, giving you a curve of average cost vs SWE-bench score. But clearly it's quite expensive even with conservative limits.

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

Sonnet 4.5 takes significantly more steps to solve instances than Sonnet 4, making it more expensive to run in practice

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

We evaluated Anthropic's Sonnet 4.5 with our minimal agent. New record on SWE-bench verified: 70.6%! Same price/token as Sonnet 4, but takes more steps, ending up being more expensive. Cost analysis details & link to full trajectories in 🧵

September 30, 2025 at 2:49 PM

klieret.bsky.social

@klieret.bsky.social

The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

August 21, 2025 at 10:34 PM

klieret.bsky.social

@klieret.bsky.social

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

August 20, 2025 at 6:02 PM

klieret.bsky.social

@klieret.bsky.social

GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.

August 8, 2025 at 3:20 PM

klieret.bsky.social

@klieret.bsky.social

Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!

August 8, 2025 at 3:20 PM

klieret.bsky.social

@klieret.bsky.social

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵

August 8, 2025 at 3:20 PM

klieret.bsky.social

@klieret.bsky.social

Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!

August 7, 2025 at 9:22 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news