Lightnews — Scholar-powered news

Sebastian Schuster

@sebschu.bsky.social

2.2K followers 350 following 11 posts

Computational semantics and pragmatics, interpretability and occasionally some psycholinguistics. he/him. 🦝

https://sebschu.com

Posts Replies Media Videos

Sebastian Schuster

@sebschu.bsky.social

What makes an extension difficult for agents?
Statistically, tasks with more lines of change in the gold solution were harder. Meanwhile, repo size and popularity had marginal effects. Qualitatively, the performance aligned poorly with human-expert perceived difficulty!

July 2, 2025 at 3:40 PM

Sebastian Schuster

@sebschu.bsky.social

What if we give them hints?

We provided two levels of human-written hints. L1: information localization (e.g., files to edit) & L2: step-by-step guidance. With hints, the best agent’s performance improves to 39%, showing that substantial human guidance is still needed.

Figure showing the comparison of results, depending on level of hints for each agent.

July 2, 2025 at 3:40 PM

Sebastian Schuster

@sebschu.bsky.social

Results! All agents we tested struggled on RExBench.

The best-performing agents (OpenHands + Claude 3.7 Sonnet and Claude Code) only had a 25% average success rate across 3 runs. But we were still impressed that the top agents achieved end-to-end success on several tasks!Res

Result plot showing final success rate, execution success rate, and file recall for each of the agents. Final success rate was still only around 25% for the best agent.

July 2, 2025 at 3:40 PM

Sebastian Schuster

@sebschu.bsky.social

Can coding agents autonomously implement AI research extensions?

We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code.

Finding: Most agents we tested had a low success rate, but there is promise!

Screenshot of the RExBench preprint title page.

July 2, 2025 at 3:40 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news